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1  Statement  of  Work 

Short-  and  Long-term  Effects  in  Prostate  Cancer  Survival:  Analysis  of  Treatment  Efficacy  and 
Risk  Prediction 
Alexander  Tsodikov,  Ph.D. 

There  has  been  no  change  in  the  statement  of  work.  A  breakdown  below  shows  what  has  been 
accomplished  in  the  first  8  months  of  the  project  at  the  University  of  Utah  and  the  portion  of  the 
project  to  be  completed  at  the  University  of  California  at  Davis. 

Tasks  accomplished  in  the  first  8  Months  of  the  project  at  the  University  of  Utah 

Task  1.  Develop  model-building  techniques  (Months  1-4) 

Task  2A.  Develop  estimation  and  hypothesis  testing  (Months  5-8) 

(a)  Develop  point  estimation 

(b)  Develop  simulation  algorithms 

(c)  Develop  hypothesis  testing 

Tasks  to  be  completed  at  the  University  of  California  at  Davis 

Task  2B.  Develop  estimation  and  hypothesis  testing  (Months  9-10) 

(d)  Develop  software  implementation 

(e)  Study  models  and  methods  by  simulation 

Task  3.  Develop  variable  selection  procedures  (Months  11-13) 

Task  4.  Analyze  data  for  significant  effects  (Months  14-18) 

(a)  Apply  estimation,  hypothesis  testing  and  variable  selection  to  MSKCC  data  and  SEER 
data. 

(b)  Identify  a  model  for  prostate  cancer  biochemical  recurrence,  prostate  cancer  specific  sur¬ 
vival,  and  overall  survival  using  methodology  and  software  developed  in  Tasks  1-4. 

Task  5.  Computer-intensive  approaches  to  prognosis  and  validation  (Months  19-24) 


Alex  Tsodikov,  Ph.D.  Date:  October  31,  2003. 
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2  Objectives 

There  has  been  no  change  in  the  project  objectives.  The  specific  aims  of  this  project  are 

1.  To  provide  a  statistical  model  that  reproduces  the  complex  survival  responses  in  prostate  cancer. 

2.  To  develop  methodology  for  analysis  of  prognosis  after  treatment  for  prostate  cancer  taking  into 
account  the  long-  and  short-term  effects  of  prognostic  factors  and  treatment. 

3.  To  develop  statistical  software  implementing  model-building,  estimation,  construction  of  prog¬ 
nostic  indices,  conditional  survival  prognosis,  and  assessment  of  the  quality  of  prognostic  clas¬ 
sifications  based  on  the  new  models. 

4.  To  apply  the  models  and  methodology  to  analyze  post-treatment  survival  of  patients  with 
prostate  cancer  using  data  from  the  Memorial  Sloan  Kettering  Cancer  Center,  the  Utah  Cancer 
Registry  and  the  SEER  database. 


3  Introduction 

The  goal  of  this  proposal  is  to  investigate  a  novel  approach  to  the  analysis  of  post-treatment 
survival  of  prostate  cancer  patients:  the  decomposition  of  the  diversity  of  survival  patterns  into 
short-term  and  long-term  effects.  We  proposed  to  identify  a  model  of  prostate  cancer  survival  in¬ 
corporating  long-  and  short-term  effects  of  prognostic  factors  and  treatment.  Novel  statistical  tools 
are  being  developed  to  make  such  models  work  for  better  prognosis  of  prostate  cancer  patients. 
During  the  first  8  months  of  the  project  we  focused  on  the  development  of  algorithms  for  model 
building,  estimation,  hypothesis  testing  and  simulation  reviewed  in  the  following  sections  of  this 
progress  report.  Fragments  of  software  implementation  were  also  developed  to  illustrate  the  per¬ 
formance  of  the  novel  statistical  tools  using  Surveillance  Epidemiology  and  End  Results  (SEER) 
survival  data  by  stage.  With  further  developments  in  methodology  and  software  planned  in  the 
project,  data  analysis  will  be  refined  to  include  a  broad  range  of  models,  explanatory  variables 
and  variable  selection  techniques. 


4  Estimation 

4.1  Nonlinear  Transformation  Models  (NTM) 

4.1.1  PH  mixture  model 

For  a  survival  function  G(t  |  f3,  z),  where  (5  are  regression  coefficients,  and  z  are  covariates,  consider 

a  PH  mixture  model  ,  ,a  v  .  . 

G{t\0,z)  =  E{F(tf^\z},  (1) 

where  F  is  the  baseline  survival  function,  and  U(f3>  z)  is  a  nonnegative  random  variable  whose 
distribution  depends  on  covariates  and  regression  coefficients.  This  model  can  be  considered  a 
compact  generalization  of  the  so-called  PH  frailty  model,  or  a  PH  model  with  a  random  effect 

G(t\z,z)  =  E{F(t)e^v},  (2) 
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where  6  is  a  predictor,  and  V  is  a  random  variable  independent  of  the  covariates,  considered 
by  Hougaard  [1984],  Klein  [1992],  Nielsen  et  al.  [1992]  and  many  other  authors,  for  different 
distributions  of  V.  Some  authors  considered  specific  frailty  variables  dependent  on  covariates,  for 
example,  Wassel  and  Moeschberger  [1993],  Clayton  and  Cuzick  [1985].  Obviously,  when  V  in  (2) 
is  allowed  to  depend  on  covariates,  the  model  families  (1)  and  (2)  are  equivalent. 

We  can  make  the  following  important  observations  about  the  class  of  PH  mixture  models  (1): 

•  The  survival  function  (1)  is  built  by  composition 

G(t\/3,z)  =  ('yoF)(t\P>z),  (3) 

where  j(x  \  (3 ,  z)  is  the  moment  generating  function  of  U. 

•  The  moment  generating  function  7(0:  |  •)  is  a  distribution  function  in  x  with  the  support  on 
[0, 1].  If  the  distribution  of  U  is  specified  parametrically,  7  is  a  parametric  regression  model  on 
[0,1]- 

•  The  fact  that  the  range  and  the  support  of  7  are  the  same  allows  one  to  build  compositions 
of  an  arbitrary  number  of  7s.  One  of  the  technical  results  we  obtained  is  that  the  class  of  PH 
mixture  models  is  closed  with  respect  to  such  compositions. 

These  observations  are  used  to  generalize  the  PH  mixture  family  into  the  NTM  family  (Section 
4.1.2)  and  to  develop  the  composition  technique  (Sections  5,  5.2)  compatible  with  the  QEM 
estimation  procedure  (Section  4.2). 

We  established  the  following  key  property  of  the  PH  mixture  model  [Tsodikov,  2003]. 

Proposition  4.1  Suppose,  we  have  an  observation  ( t,z,c )  sampled  from  the  PH  mixture  model 
under  independent  censoring,  where  t  is  an  observed  survival  time  and  c  is  a  censoring  indicator 
(c  =  0  if  t  is  a  censored  survival  time,  and  c  =  1  if  t  is  a  failure).  Then,  under  the  PH  mixture 
model  (1), 

•  the  conditional  expectation  ofU,  given  the  observed  event  ( t,z,c )  is  given  by 

E  {[/(•)  1 1,  •,  c}  =  (0  o  F)(t  |  ;c)  =  G  [. F(t )  |  •, c] , 
where  the  function  ©  is  given  by 

©fcM  =c  +  a;7^(c)()^l))’  (4) 

where  j^(x  ]  •)  =  dcy(x  |  -)/dxc,  c  =  0, 1, . . .,  j^(x  ]  •)  =  j(x  |  •). 

•  The  function  0  [x  |  •,  c]  is  nondecreasing  in  x  for  any  c  =  0, 1. 

The  nondecreasing  character  of  the  function  ©  in  the  above  statement  is  quite  natural.  The 
longer  the  subject  stays  event-free,  the  lower  the  subject’s  posterior  risk,  represented  by  0.  So 
@{F(t)  |  •,  c}  must  be  a  nonincreasing  function  of  t  for  both  failure  (c  =  1)  and  censoring  (c  =  0) 

events.  Since  the  survival  function  F(t)  is  nonincreasing  in  t,  Q(x  \  •,  c)  must  be  nondecreasing  in  x. 

It  is  interesting  to  note  that  the  population  hazard  function  for  a  heterogeneous  population  under 
the  PH  mixture  model  is  expressed  as  A (t  |  z)  —  Q{F(t)  |  •,  0}h(t),  where  h  is  the  hazard  function 
corresponding  to  F.  Even  if  h{t)  is  increasing,  the  observed  population  hazard  function  may  be 
a  decreasing  one  through  the  decreasing  behavior  of  0{F(£)|-, 0}  with  time.  This  observation 
represents  a  selection  effect  of  the  risk  set  becoming  “healthier”  with  time,  as  frail  individuals 
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leave  the  population.  This  effect  was  discovered  and  extensively  studied  in  demography  [Vaupel 
et  al.,  1979]  in  the  context  of  misinterpretation  of  mortality  trends. 

4.1.2  Nonlinear  Transformation  Models 

In  Section  4.1.1  we  considered  semiparametric  survival  models  of  the  form 

G(<|.)  =  E{F(^w|-}  =  (7oF)(i|.), 

where  7  is  a  moment  generating  function  of  a  nonnegative  random  variable  U.  We  also  noticed 
that  j(x  |  •)  is  a  distribution  function  in  x  G  [0, 1]  with  the  range  contained  in  the  same  interval  of 
[0, 1].  This  brings  us  to  the  following  natural  generalization  of  the  PH  mixture  family  of  models. 

Definition  4.1 

Let  7(3:  |  /3,  z)  be  a  parametrically  specified  distribution  function  with  the  x -domain  of  [0, 1].  Let 
F(t )  be  a  nonparametrically  specified  baseline  survival  function.  A  semiparametric  regression  sur¬ 
vival  model  is  called  a  Nonlinear  Transformation  Model  if  its  survival  function  can  be  represented 

on  thp  fnron 

G(t  |/3,z)  =  7  {F(t)  |/3,z}  =  (70  F)(t  |/3,*).  (5) 

Functions  7  will  be  called  NTM-generating  functions. 

The  class  (5)  was  introduced  in  [Tsodikov,  2003],  where  universal  estimation  algorithms  for  the 
NTM  class  were  developed  (see  Section  4.2).  The  key  requirement  that  ensures  monotonicity  and 
convergence  of  the  estimation  algorithms  of  Section  4.2  is  that  of  nondecreasing  0,  where  0  is 
defined  in  (4).  Now  that  we  no  longer  use  the  concept  of  frailty  in  the  definition  of  NTM,  Q(F  |  •,  c) 
becomes  a  surrogate  of  the  posterior  risk  such  that  its  basic  property  of  nondecreasing  Q(x  |  •,  c) 
is  preserved.  For  these  reasons,  we  will  restrict  the  family  of  NTM-generating  functions  to  those 
with  nondecreasing  0. 

The  class  of  NTM  includes  the  class  of  linear  transformation  models  [Cheng  et  al.,  1995,  1997] 

log  v(T\z)  =  -  log  0(z)  +  e,  (6) 

where  T  is  the  failure  time,  e  is  the  random  error  with  the  distribution  p,  and  v  is  some  unspecified 
strictly  increasing  function.  For  the  exponential  predictor  0(/3,  *)  =  exp (/30  +  /3T*),  the  model 
assumes  a  linear  form  in  covariates  and  transformed  response.  The  connection  between  LTM,  the 
PH  model,  the  PO  model,  and  binary  regression  models  was  discussed  in  [Doksum  and  Gasko, 
1990].  After  a  little  algebra,  a  linear  transformation  model  can  be  represented  as  an  NTM  with 
the  NTM-generating  function 

l(x\/3,z)  =p{logd((3,z)  +  q(x)},  (7) 

where  p  is  a  parametrically  specified  tail  function  (= 1-distribution  function),  and  q  is  an  inverse 
tail  function.  It  is  convenient  to  specify  q  as  the  inverse  of  p,  then  0  —  1  corresponds  to  the 
baseline  7(x|-)  =  x. 

Presentation  of  a  semiparametric  model  in  terms  of  an  NTM-generating  function  7  is  not 
unique,  as  there  is  a  number  of  ways  to  represent  an  arbitrary  monotonic  function.  Indeed, 
expression  (7)  suggests  that  a  transformation  p{q(F)},  where  F  is  an  arbitrary  survival  function, 
is  again  an  arbitrary  survival  function.  In  other  words,  the  family  of  functions 

7(3 10  =  (7°P°9)(*I0  (8) 

represents  the  same  semiparametric  model  for  any  fixed  p  and  q  as  defined  above.  The  optimal 
choice  of  p  and  q  to  represent  a  semiparametric  model  in  such  a  way  as  to  achieve  fastest  conver- 
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gence  or  to  satisfy  the  assumption  of  nondecreasing  0  represents  an  interesting  issue  for  future 
research. 

It  should  be  stressed  that  the  issue  of  non-uniqueness  of  the  representation  of  an  NTM  in 
terms  of  an  NTM-generating  function  should  not  be  confused  with  the  issue  of  identiability  of 
the  model.  For  example,  the  Cox  model  can  be  represented  as  y(x\6(z))  =  xe^  (proper  form) 
or  7(x| 6{z))  =  exp{— 9{z)(l  —  a:)}  (improper  form,  PH  model  with  cure).  In  both  models  rela¬ 
tive  effects  represented  by  regression  coefficients  entering  the  partial  likelihood  are  identical  and 
identifiable  if  predictor  0  is  coded  correctly. 


4.2  Quasi-EM  Algorithm  (QEM) 

This  section  reviews  a  universal  procedure  (QEM  algorithm)  designed  in  [Tsodikov,  2003]  to  fit 
NTM  models. 

Let  U,  i  =  1, . . . ,  n  be  a  set  of  times,  arranged  in  increasing  order,  tn+i  oo.  Associated  with 
each  U  is  a  set  of  subjects  Vt  with  covariates  zy,  j  £  T>i  who  fail  at  U,  and  a  similar  set  of  subjects 
Ci  with  covariates  zy,  j  £  Ci  who  are  censored  at  U.  The  observed  event  £y  for  the  subject  ij 
is  a  triple  (£*,  Zij,Cij),  where  c  is  a  censoring  indicator,  c  =  1  if  failure,  c  =  0  if  right  censored. 
For  any  function  A(t),  let  At  =  A(ti),  A A{  =  \A{U )  —  A{ U  —  0)|.  A  step-wise  function  H  can  be 
characterized  by  two  vectors  AH  =  (AH\, . . . ,  A Hn)T  and  t  =  (ti, . . . ,  tn)T.  With  this  notation, 
under  an  NT  model  and  noninformative  censoring,  the  likelihood  of  survival  data  takes  the  form 

e  =  ±Di\og[AHi}  +  ±  £  logtf(Fi|/3,Zy,Cy),  (9) 

»=i  i= i  jeCiUVi 

where 

i d(x  |  •,  c)  =  xc^c\x  |  •}, 

and  Di  is  the  number  of  failures  associated  with  U.  We  use  the  profile  likelihood  approach  to 
maximize  t.  The  profile  likelihood  is  defined  as  a  supremum  of  the  full  likelihood  taken  over  the 
nonparametric  part  of  the  model 

£pr(f3)  =  max£(/3,  H).  (10) 

H 

The  algorithm  follows  the  straightforward  nested  procedure: 

•  Maximize  ipr{P)  by  a  conventional  nonlinear  programming  method,  for  example,  the  Powell 
method  [Press  et  ah,  1994], 

•  For  any  {3  as  demanded  in  the  above  maximization  procedure,  solve  the  problem  (10). 


Inference  based  on  the  profile  likelihood  is  not  straightforward,  as  the  usual  theory  of  MLE  does 
not  apply  to  unlimited  dimension.  Important  results  have  been  obtained  regarding  theoretical 
justification  for  the  nonparametric  maximum  likelihood  estimation  (NPMLE)  method  and  the 
profile  likelihood  for  semiparametric  models  [Murphy,  2000,  van  der  Vaart,  1998,  Murphy  and 
van  der  Vaart,  1997].  It  was  shown  that  profile  likelihoods  with  nuisance  estimated  out  behave 
like  ordinary  likelihoods  under  some  conditions.  In  particular,  these  results  apply  to  the  PH  model, 
the  proportional  odds  (PO)  model  [Murphy,  2000,  Murphy  et  al.,  1997]  and  the  PH  frailty  model 
[Murphy,  1994,  1995],  and  presumably  to  most  other  models. 

The  following  method  (QEM)  is  used  to  obtain  the  profile  likelihood  and  solve  (10): 

Atf(fe+ 1)  = _ ^ _ 

m  Eye^©(^(fc)|/3y,^,^)’ 
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where  k  counts  iterations. 

It  can  be  shown  that  if  0  is  nondecreasing,  each  update  of  H  using  (11)  strictly  improves  the 
likelihood,  given  (3.  This  guarantees  convergence  of  the  sequence  of  likelihood  values  £  j/3, 
to  the  profile  likelihood  under  fairly  general  conditions. 

Under  a  PH  mixture  model,  the  procedure  (11)  is  an  EM  algorithm  based  on  imputation 
of  the  missing  predictor  U  in  the  Nelson-Aalen-Estimator  by  its  conditional  expectation,  given 
observed  data,  represented  by  Q(F\(3,z,c).  Under  an  NT  model,  the  procedure  works  as  a 
Quasi-EM  algorithm  without  the  missing-data  interpretation.  It  can  be  shown  [Tsodikov,  2003] 
that  imputation  in  the  QEM  procedure  is  accomplished  using  the  so-called  quasi-expectation 
operator,  QE,  which  generalizes  mathematical  expectation  operator  on  a  restricted  class  of  basis 
functions  in  a  way  that  its  linearity  and  second-order  differentiation  properties,  as  well  as  the 
Jensen  inequality  are  preserved. 

5  Model  Building 

5.1  Composition  for  PH  mixture  models 

The  idea  to  use  compounding  to  build  particular  extended  families  of  frailty  models  is  not  new. 
For  example,  Aalen  [1992]  used  a  compound  Poisson  distribution  to  extend  a  class  of  frailty  models 
by  Hougaard  [1984]. 

Consider  the  following  general  compounding  techniques  for  the  PH  mixture  model.  If  u  is  a 
nonnegative  discrete  random  variable  with  the  moment  generating  function  70(2)  =  E  {2"},  and  £*, 
are  i.i.d.  copies  of  another  nonnegative  random  variable  (independent  of  v)  with  the  the  moment 
generating  function  7^(2)  =  E  {2^},  and  U  is  a  compound  random  variable  given  by 

=  (12) 

*:=i 

then  by  the  composition  property  of  Laplace  transform, 

7(2)  =  £{2^}  =  (7<?o7„)(2).  (13) 

A  large  variety  of  semiparametric  mixture  models  can  be  derived  from  (13).  When  70(2)  corre¬ 
sponds  to  a  continuous  random  variable,  the  compound  variable  U  is  no  longer  of  the  simple  form 
(12).  However,  the  composition  7 0  07^  still  corresponds  to  a  PH  mixture  model  with  some  frailty 
random  variable  U,  as  given  by  the  following  proposition. 

Proposition  5.1  Composition  for  mixture  models. 

Let  7 e  and  7,;  be  some  two  mixture  models  70(2]-)  =  E(xu  |  •),  7^(2] ■)  =  E(x^  \  •),  where  v  and  £  are 
some  independent  nonnegative  random  variables.  Let  7  =  70  o  7^  be  the  compound  model.  Then 
7  is  also  a  mixture  model,  meaning  that  there  exists  a  nonnegative  random  variable  U  such  that 

7(xIO  =  r(xu  |  •). 

Proof.  By  the  Bernstein  theorem  (Feller  [1971]),  we  need  to  prove  that  j(e  5|*)  =  (7 q  5|-) 

is  a  completely  monotonic  function.  Let  'ip.(s)  =  7.(e_5|*).  We  have  ip(s)  ~  ipe  {—log ^7 ?($)}•  For 
any  functions  £  and  £,  the  composition  £  o  (  is  completely  monotonic  if  £  is  completely  monotonic, 
£  >  0,  and  £'  is  completely  monotonic.  Applied  to  the  functions  ?/>,  this  means  that  we  have  to 
prove  that  for  any  completely  monotonic  function  ^(s)  >  0,  the  function  f(s )  =  {—log^(s)}/  is 
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completely  monotonic.  It  can  be  proved  by  induction  that 

(-ir/'-'w-EM-i 

where  Uqi  —  1,  Gtn+1,1  ~  ^n+l,/c  =  ^nfc(^  k  2)  T  &n,k—l)  k  2,  ...  ,71  +  1,  Q,n+\jljr2  ^n,n+li 

n  =  0, 1,  •  •  •.  From  the  above  equations  it  follows  that  ank  >  0  for  any  n,  k.  Also,  ip(s)  >  0,  s  >  0, 
and  since  ip  is  completely  monotonic,  (— 1)V^(S)  ^  0-  Therefore,  (— l)n/-”)(s)  >  0,  s  >  0.  End 
of  proof. 

As  a  result  of  the  above  observations,  we  have  a  tool  to  build  hierarchical  regression  models 
using  composition  of  moment  generating  functions  7  =  70  o  7^.  The  fact  that  the  class  of  PH 
mixture  models  is  closed  with  respect  to  such  composition  allows  us  to  use  an  EM-approach  to  fit 
compound  mixture  models. 

Consider  a  model  composed  of  the  PH  and  the  PO  models.  Take  the  moment  generating 

function  /  n 

n  =  ??(•) 

rj(-)  -  log  x 

of  the  exponential  distribution  with  parameter  77(73,2 ),  corresponding  to  the  PO  model.  Take 
another  moment  generating  function 

7„(z|  •)  =  cc*0, 

corresponding  to  the  PH  model  with  predictor  9((3,  z ).  As  a  result  of  the  composition  7  =  70  07^, 
we  have  the  so-called  T-frailty  model 

G{t\e(-),r](-)}  =  {^()  +  •  (14) 

Indeed,  ip( s )  =  7(e~s|-)  =  [r](-)/{r]{-)  +  s}]0()  is  the  Laplace  transform  of  a  T-distribution  with 
scale  parameter  rj  and  shape  parameter  9,  and  we  have  the  interpretation  of  the  compound  model 
(14)  as  a  T-frailty  model. 

It  is  assumed  that  predictors  depend  on  /3,  z  via  the  form  /?0  +  0Tz,  where  /?o  stands  for  the 
intercept  term  of  the  predictor.  Also,  different  predictors  have  independent  sets  of  regression 
coefficients  9  =  9(/3w  +  (3jz),  77  =  t](02o  +  01  z).  To  avoid  overparameterization  of  the  T-frailty 
model,  the  intercept  in  9  is  fixed  at  zero  0u)  =  0.  With  the  above  conventions,  a  test  for  /32  =  0 
is  a  test  for  the  PO  assumption,  while  a  test  for  (32  —  0  is  a  test  for  the  PH  assumption.  Setting 
/3i  =  /32  =  0  corresponds  to  a  test  of  homogeneity. 


5.2  Composition  for  NTMs 

We  extended  the  composition  techniques  for  the  PH  mixture  model  (Section  5.1)  to  the  NTM 
class.  The  composition  technique  offers  a  simple  way  to  build  hierarchical  families  of  models  that 
combine  the  features  of  simpler  models.  Specifically,  if  70  and  7^  are  two  different  NT  models  with 
predictors  9,  and  77,  respectively,  then 

7(^10  =  (To®  7»?)(^|-)  (15) 

is  a  new  semiparametric  model  with  two  predictors  9  and  77.  If  7^  (a;  |  • )  =  x  for  some  value  of  9 
(usually  for  9=1),  then  the  model  (15)  includes  models  70  and  7^  as  nested  special  cases.  The 
fact  that  NTM-generating  functions  y(x  |  •)  are  all  defined  on  a:  €  [0, 1]  and  have  the  range  in  the 
same  interval  allows  us  to  compose  as  complex  a  hierarchical  model  as  needed.  In  particular,  a 
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composition  of  the  PH  model  in  the  improper  form 

9{z))  =  exp{— 0(z)(l  -  a;)}  (16) 

with  the  one  in  the  proper  form 

Tv  (*\v(z))  =  xv(z)  (17) 

results  in  the  so-called  PHPH  model  incorporating  long-  and  short-term  survival  effects 

G(t\z)  =  exp  \-9(z)  {1  -  [F(t)]7?(z)}] .  (18) 

We  have  shown  that  operation  of  composition  preserves  the  key  property  of  nondecreasing  0, 
and  the  EM-like  estimation  algorithms  of  Section  4.2  remain  applicable  within  the  hierarchical 
family. 


Proposition  5.2  Composition. 

Let  70  and  be  some  two  NTM-generating  functions,  each  satisfying  the  assumption  of  nonde¬ 
creasing  0,  where  0  is  given  by  (4),  and  let  7  =  70  07^  be  the  compound  function  (compositions 
are  taken  with  respect  to  x).  Let  Qa  be  the  Q-function  (4)  corresponding  to  7 a,  a  =  9,1],  and  to 
the  compound  function  7 ,  if  a  is  blank.  Then  (A) 

0(z|  -,c)  =  0„(*|.,O)  {(©0  07,,)  (x  |  c)  -  c}  +  cQv(x  |  •,  c),  (19) 

where  c  =  0, 1  and  (0  o  7)  {x  |  •,  c)  is  understood  as  ©{7(2:  |  -)|-,  c};  and 

(B)  The  function  ©  (4)  derived  from  the  compound  NTM-generating  function  7  is  nondecreasing 
in  x  as  required  for  monotonicity  and  convergence  of  the  estimation  algorithms  (See  Section  4-2). 
Proof.  Proof  of  first  statement  is  a  straightforward  exercise  in  differentiation  of  compound  func¬ 
tions  entering  (4).  Validity  of  second  statement  follows  from  (19)  upon  observation  that  all  com¬ 
ponents  of  (19)  are  nondecreasing  functions  in  x.  End  of  proof. 

Equation  (19)  simplifies  derivation  of  0  for  the  compound  models  through  direct  use  of  0s 
corresponding  to  submodels  participating  in  the  composition. 

Consider  another  example  of  building  hierarchical  models  using  composition.  Using  the  com¬ 
position  framework,  the  Dabrowska  and  Doksum  model  [Dabrowska  and  Doksum,  1988]  can  be 
represented  through  a  composition  7  =  7i/a  0  7o  o  7„,  where  70  is  an  NTM-generating  function 
for  the  PO  model,  ja  and  ji/a  correspond  to  the  PH  model,  and  a  is  a  scalar,  independent  of 
covariates  1 

oi0'  (20> 

The  above  model  becomes  the  PO  model  if  a  =  1,  and  it  becomes  the  PH  model  in  the  limit  as 
a  — »  0.  With  the  above  model,  the  PH  assumption  corresponds  to  the  border  of  the  parametric 
space  (a  =  0)  and,  for  this  reason,  we  prefer  to  use  the  E-frailty  model  (14)  in  this  paper  to 
illustrate  the  methodology. 

The  E-frailty  model  (14)  can  be  built  as  a  composition  of  the  NTM-generating  functions 

Te{x\ ')  = 


and 


Tv(x\  •)  = 


v{-) 


1 ?(•)  -  log  a: 

corresponding  to  the  PH  and  the  PO  models,  respectively,  without  having  to  work  with  their 
frailty  underpinnings.  Using  (4),  we  derive  the  ©-functions  corresponding  to  the  two  submodels: 


•  PH  model:  Qq{x  |  •,  c)  =  #(•), 
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•  PO  model:  Qv(x  |  -,c)  =  (c+  l){r/(-)  -  logo:} 


Next,  the  compound  0  is  derived  from  (19)  using  the  above  expressions  for  the  submodels: 

Q(x  I  •  c)  =  —  -  - 

11  )Cj  rj(-)  —  logs 


(21) 


6  Hypotheses  Testing 


It  should  be  noted  that  the  problem  of  deriving  a  variance  estimator  with  the  QEM  procedure  is 
quite  different  from  the  one  with  the  EM  algorithm.  The  problem  with  the  EM-based  information 
matrix  is  that  the  likelihood  is  unavailable  in  a  closed  form,  or  it  is  difficult  to  differentiate.  In  our 
situation,  however,  the  problem  is  formulated  without  use  of  missing  data,  and  the  likelihood  is  a 
quite  simple,  easily  differentiable  function.  So,  it  is  not  a  problem  to  write  down  the  full  model 
information  matrix.  The  problem  is  that  the  number  of  parameters  of  a  semiparametric  model  is 
potentially  unlimited.  For  this  reason,  obtaining  the  inverse  of  the  full  information  matrix  can  be 
computationally  prohibitive. 

We  have  developed  a  numerically  efficient  procedure  to  overcome  the  problem  using  the  profile 
information  matrix 


TP 

JPP 


a2v(/3} 

d(3d[3T  ’ 


(22) 


where  iw  is  the  profile  likelihood  (10)  obtained  by  searching  for  the  fixed  point  AH* ((3)  of  (11). 
Implicit  differentiation  of  the  profile  likelihood  yields 


TPP  -  IP,P  + 


(dAH* 

\W 


IaH,aH 


dAH* 

~W 


where 


(dAH* 
+  (  dp 

d2£ 


IaH,/3  +  JAiT,/3 


dAH* 

dP 


(23) 


'a, 6 


dadbT 

for  any  two  vectors  a  and  b.  To  get  a  variance  estimator  for  regression  coefficients  P,  only  a  small 
(dim/3  x  dim/3)  profile  matrix  needs  to  be  inverted.  The  downside  of  (23)  is  that  the  Jacobian 
matrix  dAH* /dp  is  generally  unavailable  in  a  closed  form.  The  success  of  the  above  approach  is 
determined  by  the  existence  of  an  efficient  numerical  method  to  compute  dAH* /dp.  Generally, 
computation  of  dAH* /dp  is  as  difficult  as  taking  the  inverse  of  the  original  full  model  information 
matrix  (0(n3)  operations  required),  and  this  derivation  defeats  the  purpose. 

However,  if  the  functional  D(H,  1 1-)  depends  on  (II,  t )  only  through  //(/.),  which  corresponds  to 
the  NT  models  (Section  4.1.2),  the  system  of  linear  equations  for  dAH* /dp  acquires  a  triangular 
form,  and  it  can  be  solved  recurrently.  Indeed,  with  an  NT  model  we  have 

elk  =  &k~  Js, e(FI/3,  z,i’  Cy)'  (24) 

where  0  is  given  by  (4).  Differentiating  the  above  score  equation  with  respect  to  A H„ 
the  elements  of  the  IAfj  information  matrix 

dH  Dk 


we  get 


dAHkdAHm  A  H2 


3 km  £  Q(Fi\p,  Zij) 


,m) 


Page  12 


Report 


CHANGE  OF  GRANTEE  INSTITUTION  PI:  Tsodikov,  Alexander 


where  6km  =  1,  if  k  =  m  and  Skm  —  0  otherwise,  and 

00(4) 


Q{x\-,c)  = 


=  {0(4,  c)  -  c}  x  {0(4,  c)  —  0(4,  c  +  1)}  . 


(25) 


d  log(x) 

Consider  the  self  consistency  equation  (11)  at  the  fixed  point  A H*(0).  Implicit  differentiation  of 
the  likelihood  after  a  little  algebra  gives  the  following  system  of  equations 


dHj-1  _  r  (m 
dp  k  v  op 


A  A  D  dH* 

+  '4l  +  gB‘ai3 


(26) 


where 


4  v-  dQ{F*\P,zihCij) 

Ak=  2^  -QQ  > 

ij€Kk  up 

Bi=  E  Q{F;\p,zihCij), 

jeCjUO, 


and 


(Ag;)2 

Dt 


The  system  of  equations  (26)  is  linear  in  dHl/dp  with  an  upper  triangular  matrix.  Such  systems 
can  be  solved  recurrently.  On  substitution  of  the  above  solutions  into  (23),  the  profile  information 
matrix  for  an  NT  model  is  obtained. 

We  are  working  on  software  implementation  and  numerical  experiments  with  the  above  method. 

Also,  we  have  implemented  a  pragmatic  numerical  approach  similar  to  the  one  proposed  in 
reference  [Nielsen  et  al.,  1992],  In  the  course  of  maximization  of  the  profile  likelihood  with  respect 
to  regression  coefficients  /3,  a  dense  sample  of  the  profile  likelihood  surface  is  generated  in  the 
vicinity  of  a  stationary  point.  The  curvature  of  the  profile  likelihood  surface  at  the  stationary 
point  can  be  estimated  by  fitting  a  quadratic  function  to  some  domain  around  the  point,  using 
least  squares.  For  example,  the  domain  can  be  limited  to  points  that  cannot  be  rejected  using 
the  likelihood  ratio  test  (applied  informally).  Our  numerical  experiments  have  shown  that  this 
method  is  unstable  for  models  with  more  than  one  or  two  covariates.  Also,  it  is  much  less  efficient 
computationally  compared  to  the  one  based  on  implicit  differentiation  of  the  profile  likelihood. 


7  Preliminary  Data  Analysis 

As  an  example,  we  used  data  from  the  National  Cancer  Institutes  Surveillance  Epidemiology 
and  End  Results  (SEER)  program.  Using  the  publicly  available  SEER  database,  11621  cases  of 
primary  prostate  cancer  diagnosed  in  the  state  of  Utah  between  1988  and  1999  were  identified.  The 
following  selection  criteria  were  applied  to  a  total  of  19819  Utah-cases  registered  in  the  database: 
valid  positive  survival  time,  valid  stage  of  the  disease,  age>  18  years.  Prostate  cancer  specific 
survival  was  analyzed  by  stage  of  the  disease  (localized/regional  vs.  distant).  For  the  definition 
of  stages  as  well  as  for  other  details  of  the  data  we  refer  the  reader  to  SEER  documentation 
http://seer.cancer.gov/. 

Three  models  PH,  PO,  and  T-frailty  model  with  z  representing  two  groups  corresponding  to 
localized/regional  stage  (10765  cases)  and  distant  stage  (856  cases),  respectively,  were  fitted  using 
the  profile  QEM  algorithm.  Estimates  of  model  parameters  are  given  in  Table  1. 

Observed  (Kaplan-Meier)  and  expected  model-based  estimates  of  the  survival  functions  by 
group  as  well  as  diagnostic  plots  are  shown  in  Figure  1. 
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Figure  1:  Observed  vs.  expected  plots  corresponding  to  the  PH  and  the  PO  model  fitted  to  the 
prostate  cancer  data.  Thin  and  thick  lines  correspond  to  observed  and  expected  plots,  respectively. 

It  is  evident  from  the  comparison  of  observed  and  expected  survival  functions  that  the  PO 
model  provides  better  fit  to  the  data  than  the  PH  model.  The  T-frailty  model  provides  the 
fit  (not  shown)  which  is  very  similar  to  the  PO  model.  We  test  the  model  assumptions  using 
the  hierarchical  structure  of  the  T-frailty  model.  Using  the  likelihood  ratio  test  with  the  profile 
likelihood,  the  PO  assumption  could  not  be  rejected  (p=0.712). 

8  Key  Research  Accomplishments 

Simmarizing,  the  key  research  accomplishments  of  the  first  8  months  period  of  the  project  are: 

1.  Development  of  the  class  of  Nonlinear  Transformation  Models  (NTM)  and  associated  QEM 
estimation  procedures  and  their  computer  implementation; 

2.  Development  of  composition  technique  as  a  tool  for  model  building.  We  established  that  the 
class  of  PH  mixture  models  and  NTM  is  closed  with  respect  to  compositions  of  so-called  model 
generating  functions.  Moreover,  we  established  that  QEM  estimation  procedures  are  applicable 
to  any  model  built  using  this  technique; 

3.  A  numerically  efficient  algorithm  has  been  developed  for  estimation  of  the  inverse  of  profile 
information  matrix  for  NTM  that  avoids  taking  inverse  of  the  full  information  matrix  of  the 
semiparametric  model.  This  algorithm  will  be  used  for  testing  hypotheses,  variable  selection 
and  estimation  of  confidence  intervals. 
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Model 

Parameter 

Point- 

estimate 

Confidence 

interval 

p- Value 

Test 

PH 

PpK 

2.734 

<0.0001 

Homogeneity 

PO 

Ppo 

-3.251 

(-3.416,-3.086) 

<0.0001 

Homogeneity 

T-frailty 

PPE 

Ppo 

0.071 

-3.149 

(-0.310,0.452) 

(-3.714,-2.583) 

0.712 

<0.0001 

<0.0001 

PpR  =  0,  PO 
(3fo  =  0,  PH 
Homogeneity 

Table  1:  Parameter  estimation  and  hypothesis  testing  for  prostate  cancer  data  based  on  PH,  PO, 
and  P-frailty  models.  Regression  coefficients  /?pjj  and  /?po  measure  the  disadvantage  of  being  in 
the  distant  stage  relative  to  local/regional  stage  as  represented  by  log  odds  ratio  and  log  hazards 
ratio,  respectively.  Negative  fipQ  and  positive  /?pjj  means  worse  survival. 

9  Reportable  Outcomes 

9 . 1  Manuscripts 

1.  Tsodikov,  A.  (2003)  Semiparametric  models:  A  generalized  self-consistency  approach,  Journal 
of  the  Royal  Statistical  Society,  Series  B,  Vol.  65,  759-774. 

2.  Tsodikov,  A.,  Ibrahim,  J.G.,  and  Yakovlev,  A.Y.  (2003)  Estimating  Cure  Rates  from  Survival 
Data:  An  Alternative  to  Two-Component  Mixture  Models,  Journal  of  the  American  Statistical 
Association,  (review  paper)  to  appear  in  December  2003. 

3.  Boucher,  K.,  Asselain,  B.,  Tsodikov,  A.,  Yakovlev,  A.  Semiparametric  versus  parametric  regres¬ 
sion  analysis  based  on  the  Bounded  Cumulative  Hazard  Model:  An  application  to  breast  cancer 
recurrence,  In  Semiparametric  Models  in  Survival  Analysis,  Quality  of  Life  and  Reliability, 
Birkhauser  (invited  paper),  to  appear. 

9.2  Presentations 

Tsodikov,  A.  (2003)  Generalized  Self-Consistency  Methods  for  Cure  Models,  Joint  Statistical 
Meetings,  Invited  session  on  Cure  Models,  (invited),  San  Francisco,  August  2003. 

10  Conclusions 

Most  semiparametric  survival  models  can  be  induced  by  frailties.  Compounding  the  distribution 
of  frailty  offers  a  way  to  build  hierarchical  families  of  semiparametric  models  that  can  be  used  to 
test  model  assumptions  and  to  reproduce  complex  patterns  of  covariate  effects  using  more  than 
one  predictor.  A  PH  mixture  model  represents  survival  function  G  as  a  composition  G  =  7  o  F, 
where  7  is  a  moment  generating  function,  and  F  is  a  nonparametrically  specified  baseline  survival 
function.  We  note  that  hierarchical  PH  mixture  models  can  be  built  using  compositions  of  the  form 
G  —  71  o. .  ,o7mo F,  where  7,  are  moment  generating  functions  for  submodels.  Mixture  models  can 
be  fit  by  an  EM  algorithm  which  is  specified  as  repeated  imputation  of  the  missing  frailty  variable 
using  its  conditional  expectation,  given  observed  event.  We  find  that  this  conditional  expectation 
is  represented  as  a  composition  ©  o  F,  where  0  is  defined  through  first  two  derivatives  of  7.  The 
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above  framework  is  naturally  generalized  as  the  existence  of  frailty  interpretation  is  required  neither 
for  model  building  nor  for  model  fitting.  For  the  class  of  Nonlinear  Transformation  Models,  we 
formulate  composition  rules  for  7  and  0  and  develop  the  QEM  generalization  of  the  EM  algorithm. 

During  the  first  phase  of  the  project  we  developed  a  basic  toolbox  of  analytic  and  algorithmic 
tools  for  model  building,  model  fitting  and  hypothesis  testing.  Fragments  of  software  implemen¬ 
tation  were  developed  and  used  to  demonstrate  the  utility  and  superior  numerical  performance  of 
these  methods  in  real  and  simulated  data. 

During  the  second  phase  of  the  project  we  are  planning  to  continue  to  build  on  the  results 
obtained  so  far  extending  the  arsenal  of  methods  to  include  variable  selection  and  computer¬ 
intensive  methods.  We  will  continue  software  implementation  of  these  procedures  under  a  common 
shell,  verification  of  the  methods  by  simulations  and  their  application  to  real  prostate  cancer  data. 


11  References 
References 

O. O.  Aalen.  Modelling  heterogeneity  in  survival  analysis  by  the  compound  Poisson  distribution. 
Annals  of  Applied  Probability,  2:951-972,  1992. 

S.R.  Cheng,  L.J.  Wei,  and  Z  Ying.  Analysis  of  transformation  models  with  censored  data. 
Biometrika ,  82:835-845,  1995. 

S.R.  Cheng,  L.J.  Wei,  and  Z  Ying.  Predicting  survival  probabilities  with  semiparametric  trans¬ 
formation  models.  Journal  of  the  American  Statistical  Association ,  92(437):227-235,  1997. 

D.  Clayton  and  J.  Cuzick.  Multivariate  generalizations  of  the  proportional  hazards  model.  Journal 
of  the  Royal  Statistical  Society,  Series  A,  148:82-117,  1985. 

D.M.  Dabrowska  and  K.A.  Doksum.  Estimation  and  testing  in  a  two-sample  generalized  odds-rate 
model.  Journal  of  the  Americal  Statistical  Association,  83:744-749,  1988. 

K.H.  Doksum  and  M.  Gasko.  On  a  correspondence  between  models  in  binary  regression  analysis 
and  in  survival  analysis.  International  Statistical  Review,  58:243-252,  1990. 

W.  Feller.  An  Introduction  to  Probability  Theory  and  its  Applications.  John  Wiley  and  Sons,  New 
York,  1971. 

P.  Hougaard.  Life  table  methods  for  heterogeneous  populations:  Distributions  describing  the 
heterogeneity.  Biometrika,  71(1):75— 83,  1984. 

J.P.  Klein.  Semiparametric  estimation  of  random  effects  using  the  Cox  model  based  on  the  EM 
algorithm.  Biometrics,  48:795-806,  1992. 

S.A  Murphy,  A.J.  Rossini,  and  A.W.  van  der  Vaart.  Maximum  likelihood  estimation  in  the 
proportional  odds  model.  Journal  of  the  American  Statistical  Association,  92(439):968— 976, 
1997. 


Page  16 


Report 


CHANGE  OF  GRANTEE  INSTITUTION  PI:  Tsodikov,  Alexander 


S.A.  Murphy  and  A.W.  van  der  Vaart.  Semiparametric  likelihood  ratio  inference.  The  Annals  of 
Statistics,  25:1471-1509,  1997. 

S.A.  Murphy.  Consistency  in  a  proportional  hazards  model  incorporating  a  random  effect.  The 
Annals  of  Statistics,  22(2):712— 731,  1994. 

S.A.  Murphy.  Asymptotic  theory  for  the  frailty  model.  The  Annals  of  Statistics,  23(1):182— 198, 
1995. 

S.A.  Murphy.  On  profile  likelihood.  Journal  of  the  American  Statistical  Association,  95:449-485, 

2000. 

G.G.  Nielsen,  R.D.  Gill,  P.K.  Andersen,  and  T.I.  Sorensen.  A  counting  process  approach  to 
maximum  likelihood  estimation  in  frailty  models.  Scandinavian  Journal  of  Statistics,  19:25-43, 
1992. 

W.H.  Press,  B.P.  Flannery,  S.A.  Teukolsky,  and  W.T.  Vetterling.  Numerical  Recipies  in  Pascal. 
The  Art  of  Scientific  Computing.  Cambridge  University  Press,  New  York,  NY,  1994. 

A.  Tsodikov.  Semiparametric  models:  a  generalized  self-consistency  approach  (in  press).  Journal 
of  the  Royal  Statistical  Society,  Series  B,  2003. 

A.W.  van  der  Vaart.  Asymptotic  statistics.  Cambridge  Series  in  Statistical  and  Probabilistic 
Mathematics.  Cambridge  University  Press,  Cambridge,  UK,  1998. 

J.W.  Vaupel,  K.G.  Manton,  and  E.  Stallard.  The  impact  of  heterogeneity  in  individual  frailty  on 
the  dynamics  of  mortality.  Demography,  16:439-454,  1979. 

J.T.  Wassel  and  M.L.  Moeschberger.  A  bivariate  survival  model  with  modified  gamma  frailty  for 
assessing  the  impact  of  interventions.  Statistics  in  Medicine,  12:241-248,  1993. 


12  Appendix 

List  of  papers  presented  in  the  appendix 

1.  Tsodikov,  A.  (2003)  Semiparametric  models:  A  generalized  self-consistency  approach,  Journal 
of  the  Royal  Statistical  Society,  Series  B,  Vol.  65,  759-774. 

2.  Tsodikov,  A.,  Ibrahim,  J.G.,  and  Yakovlev,  A.Y.  (2003)  Estimating  Cure  Rates  from  Survival 
Data:  An  Alternative  to  Two-Component  Mixture  Models,  Journal  of  the  American  Statistical 
Association,  (review  paper)  to  appear  in  December  2003. 

3.  Boucher,  K.,  Asselain,  B.,  Tsodikov,  A.,  Yakovlev,  A.  Semiparametric  versus  parametric  regres¬ 
sion  analysis  based  on  the  Bounded  Cumulative  Hazard  Model:  An  application  to  breast  cancer 
recurrence,  In  Semiparametric  Models  in  Survival  Analysis,  Quality  of  Life  and  Reliability, 
Birkhauser  (invited  paper),  to  appear. 


Page  17 


Appendix 


J.  Ft.  Statist.  Soc.  B  (2003) 
65,  Part  3,  pp.  759-774 


Semiparametric  models:  a  generalized 
self-consistency  approach 


A.Tsodikov 

University  of  Utah,  Salt  Lake  City,  USA 
[Received  December  2002.  Revised  February  2003] 

Summary.  In  semiparametric  models,  the  dimension  d  of  the  maximum  likelihood  problem  is 
potentially  unlimited.  Conventional  estimation  methods  generally  behave  like  0(d 3).  A  new  0(d) 
estimation  procedure  is  proposed  for  a  large  class  of  semiparametric  models.  Potentially  unlim¬ 
ited  dimension  is  handled  in  a  numerically  efficient  way  through  a  Nelson-Aalen-like  estimator. 
Discussion  of  the  new  method  is  put  in  the  context  of  recently  developed  minorization-maxi- 
mization  algorithms  based  on  surrogate  objective  functions.  The  procedure  for  semiparametric 
models  is  used  to  demonstrate  three  methods  to  construct  a  surrogate  objective  function:  using 
the  difference  of  two  concave  functions,  the  EM  way  and  the  new  quasi-EM  (QEM)  approach. 

The  QEM  approach  is  based  on  a  generalization  of  the  EM-like  construction  of  the  surrogate 
objective  function  so  it  does  not  depend  on  the  missing  data  representation  of  the  model.  Like 
the  EM  algorithm,  the  QEM  method  has  a  dual  interpretation,  a  result  of  merging  the  idea  of 
surrogate  maximization  with  the  idea  of  imputation  and  self-consistency.  The  new  approach  is 
compared  with  other  possible  approaches  by  using  simulations  and  analysis  of  real  data.  The 
proportional  odds  model  is  used  as  an  example  throughout  the  paper. 

Keywords :  EM  algorithm;  Frailty;  Nonparametric  maximum  likelihood  estimation;  Profile 
likelihood;  Semiparametric  models 

1.  Introduction 

Potentially  unlimited  dimension  has  been  the  most  critical  deterrent  to  the  use  of  maximum 
likelihood  estimation  (MLE)  in  semiparametric  regression  models.  In  survival  analysis,  meth¬ 
ods  based  on  the  partial  likelihood  (Cox,  1972)  are  specific  to  the  proportional  hazards  (PH) 
model  and  do  not  extend  to  other  models.  Straightforward  Newton-type  methods  of  maximiz¬ 
ing  the  likelihood  for  the  full  model  generally  require  0(d 3)  operations  to  solve  the  system  of 
score  equations,  where  d  is  the  number  of  model  parameters.  The  principal  part  of  the  set  of 
d  parameters  in  a  semiparametric  model  is  used  to  specify  a  stepwise  function  H  which  ap¬ 
proaches  a  continuous  ‘true*  H  in  probability,  as  d  oo.  Although  theoretically  almost  any 
likelihood  can  be  maximized  by  a  Newton-type  method,  its  high  complexity  makes  the  prob¬ 
lem  computationally  difficult  for  large  d .  The  development  of  general,  stable  and  numerically 
efficient  algorithms  for  semiparametric  MLE  has  been  a  long-standing  problem  (Fleming  and 
Lin,  2000).  Such  algorithms  are  the  subject  of  this  paper.  The  argument  goes  as  follows.  The 
bottle-neck  of  a  maximization  algorithm  for  a  semiparametric  likelihood  is  the  estimation  of 
H.  Let  /  be  the  log-likelihood  of  a  semiparametric  model,  treated  as  a  functional  of  H.  Consider 
a  class  of  continuous  semiparametric  models  with  the  log-likelihood  of  the  form  (informally) 

l  =  E  A  log{dff(0}  +  E  log{  W  t\z)}9  (1) 

t  t 
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where  Dt  is  the  number  of  exact  observations  at  t  (failures),  z  is  a  vector  of  covariates  and 
'd  >  0  is  some  functional  of  H.  The  basic  assumption  that  contributes  to  equation  (1)  is  that  the 
probability  of  failure  in  [r,  t  +  dt]  is  proportional  to  d H(t),  which  is  differentiability.  To  obtain 
an  estimator  for  H ,  we  differentiate  l  with  respect  to  the  set  of  {d H(r)}.  Informally,  we  arrive 
at  the  so-called  self-consistency  equation 

d  H(t)  =  (2) 

where  ©  is  a  functional  representing  a  negative  ‘derivative’  of  log(^).  Since  both  sides  of  equation 
(2)  depend  on  H,  an  iterative  procedure  is  required  to  make  the  equation  self-consistent, 

d  tf(*+1>(r)  =  Dr/£©(//aVk),  (3) 

where  k  counts  iterations.  Iterative  updating  of  H  by  using  equation  (3)  is  the  basic  idea  behind 
the  algorithm.  As  we  shall  see,  the  above  procedure  is  intimately  linked  to  the  EM  algorithm 
as  used  to  fit  certain  PH  frailty  models  in  survival  analysis  (Oakes,  1989;  Klein,  1992;  Nielsen 
et  al ,  1992).  The  EM  algorithm  handles  H  in  an  0(d)  way  through  the  use  of  the  Nelson-Aalen- 
Breslow  estimator  (Andersen  et  al .,  1993)  for  the  cumulative  hazard  H.  This  is  made  possible 
as  the  M-step  reduces  to  the  PH  model.  However,  a  large  amount  of  analytic  work  would  be 
required  to  specify  an  estimation  procedure  for  a  new  non-PH  model.  Expectation  at  the  E-step 
may  prove  to  be  inaccessible  in  a  closed  form,  and  Monte  Carlo  extensions  of  the  EM  approach 
are  much  less  computationally  attractive.  Recently,  an  optimization  transfer  approach  (Lange 
et  al ,  2000)  was  proposed  that  allows  us  to  construct  EM-like  procedures  without  the  use  of 
missing  data.  For  a  target  function  /(x),x  e  IR'\  the  minorization-maximization  (MM)  algo¬ 
rithm  (Lange  et  a/.,  2000)  proceeds  by  construction  of  the  so-called  surrogate  objective  function 
0(x|y)  such  that  £(y|y)  =/(y),  and  <2(x|y)  </(y),  for  any  x,  to  ensure  monotonicity  of  the 
procedure.  Maximization  of  the  target  function  /  proceeds  iteratively  as 

x(*+D  =  argmax{<2(x|x^)}.  (4) 

The  MM  algorithm  converges  in  /  and  in  x  under  fairly  general  conditions  (Lange  et  al ,  2000). 
In  the  likelihood  interpretation,  the  EM  algorithm  is  a  particular  case  of  the  MM  algorithm. 
Unfortunately,  there  is  no  automatic  way  to  construct  Q.  The  procedure  (3)  interpreted  as  an 
MM  algorithm  is  used  to  highlight  three  methods  to  construct  a  surrogate  objective  function: 
using  the  difference  of  two  concave  functions,  the  EM  way  and  a  new  quasi-EM  (QEM)  ap¬ 
proach.  These  methods  link  the  EM  algorithm  for  frailty  models  and  its  modifications  with 
the  MM  algorithms.  In  the  QEM  approach,  ‘E’  in  the  EM  is  replaced  by  the  quasi-expectation 
operator  QE,  which  is  not  based  on  the  concept  of  a  random  variable.  The  result  is  the  so- 
called  QEM  algorithm,  which  presents  us  with  a  recipe  of  generalizing  an  EM  procedure  into  a 
‘distribution-free’  one,  representing  a  particular  MM  algorithm. 

2.  Profile  likelihood  approach 

The  problem  of  nonparametric  maximum  likelihood  estimation  (NPMLE)  with  the  semipara- 
metric  model  is  to  find  estimates  of  regression  coefficients  /3  and  an  NPMLE  estimate  of  H 
such  that  they  deliver  the  maximum  of  a  suitably  defined  likelihood  function  /  =  /(/3,  H).  In  this 
paper  we  use  a  profile  likelihood  approach  to  maximize  /.  The  profile  likelihood  is  defined  as  a 
supremum  of  the  full  likelihood  taken  over  the  nonparametric  part  of  the  model 

/pr(/3)  =  max{/(/3,  H)}. 

H 


(5) 
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Assuming  that  we  can  find  the  global  maximum  of  /  with  respect  to  H,  given  /3,  we  may  write 
the  profile  likelihood  as  an  implicit  function  of  (3 

/pr(/3)=/{/3,//(/3)},  (6) 

where  H{(3)  is  the  solution  of  equation  (5).  Our  algorithms  will  be  designed  following  a  straight¬ 
forward  nested  procedure: 

(a)  maximize  /pr(/3)  by  a  conventional  non-linear  programming  method  (e.g.  a  directions  set 
method); 

(b)  for  any  f3  as  demanded  in  the  above  maximization  procedure,  solve  problem  (5). 

As  the  number  of  parameters  of  a  semiparametric  model  is  potentially  unlimited,  obtaining 
the  inverse  of  the  full  information  matrix  can  be  computationally  prohibitive.  Therefore,  we  use 
the  profile  information  matrix 

jP  __  d%r{/3/pr(/3)} 

to  derive  a  standard  error  estimator  for  (3.  In  this  paper  we  adopt  a  pragmatic  numerical  ap¬ 
proach.  In  the  course  of  maximization  of  the  profile  likelihood  with  respect  to  / 3 ,  a  dense  sample 
of  the  profile  likelihood  surface  is  generated  near  a  stationary  point.  The  curvature  of  the  profile 
likelihood  surface  at  the  stationary  point  can  be  estimated  by  fitting  a  quadratic  function  to 
some  domain  around  the  point  by  using  least  squares.  For  example,  the  domain  can  be  lim¬ 
ited  to  points  that  cannot  be  rejected  by  using  the  likelihood  ratio  test  (applied  informally). 
Alternatively,  a  more  sophisticated  approach  can  be  used  based  on  implicit  differentiation 
of/pr- 

The  rest  of  the  paper  will  be  devoted  to  constructing  efficient  NPMLE  methods  for  obtaining 
/pr,  i.e.  for  maximizing  l  with  respect  to  H,  given  {3 ,  as  this  is  the  crux  of  the  matter. 

Practically,  inference  based  on  the  profile  likelihood  is  similar  to  that  based  on  the  partial 
likelihood  for  the  PH  model,  which  is  quite  convenient.  Theoretically,  however,  inference  based 
on  the  profile  likelihood  is  not  straightforward,  as  the  usual  theory  of  MLE  does  not  apply  to 
unlimited  dimension.  Important  results  have  been  obtained  regarding  a  theoretical  justification 
for  the  NPMLE  method  and  the  profile  likelihood  for  semiparametric  models  (Murphy,  2000; 
van  der  Vaart,  1998;  Murphy  and  van  der  Vaart,  1997).  It  was  shown  that  profile  likelihoods 
with  nuisance  parameters  estimated  out  behave  like  ordinary  likelihoods  under  some  conditions. 
In  particular,  these  results  apply  to  the  PH  model,  the  proportional  odds  (PO)  model  (Murphy, 
2000;  Murphy  et  a /.,  1997)  and  the  PH  frailty  model  (Murphy,  1994,  1995),  and  presumably  to 
most  other  models. 

Let  tj,  i  =  1, . . . ,  n,  be  a  set  of  times,  arranged  in  increasing  order,  and  define  f„+i  :=  co.  As¬ 
sociated  with  each  tj  is  a  set  of  individuals  £>/  with  time-independent  covariates  z / j ,  j  e  V\,  who 
fail  at  tj,  and  a  similar  set  of  individuals  C\  with  covariates  z,y,  j  e  Cj,  who  are  censored  at  tj. 
The  observed  event  Exj  for  the  subject  i  j  is  a  triple  (f/,z/;-,  c/;),  where  c  is  a  censoring  indicator: 
c  =  l  if  failure;  c  =  0  if  right  censored.  For  any  function  A(t ),  let  A/  =  A(f/),  AAj  —  \A(tj )  — 
A(tj  -  0)|.  A  stepwise  function  H  can  be  characterized  by  two  vectors  AH  =  (AH\ , . . . ,  AHn)T 
and  t  =  (*i, . . . ,  tn)T.  With  this  notation,  the  likelihood  of  survival  data  under  non-informative 
censoring  takes  the  form 

/  =  £  Djlog(AHj)  +  £  £  \og{'d{A\\,tj\(3,zij,Cij)}, 

/=!  i=l  jeCjUUj 


(8) 
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where  D,  is  the  number  of  failures  that  are  associated  with  tu  and  the  function  will  be  specified 
later  for  the  class  of  non-linear  transformation  models  (NTMs). 

3.  EM  algorithm  for  a  semiparametric  model 

For  example,  consider  a  PO  model  for  the  survival  function  G,  given  covariates  z, 

(9) 

where  9  is  a  predictor  and  H  is  some  nonparametrically  specified  base-line  cumulative  hazard. 
The  model  is  named  after  the  PO  property  that  for  any  two  values  of  the  predictor,  6\  and  62 , 
with  corresponding  survival  functions  Gift)  =  G(/|0/),  1  =  1,2,  the  odds  ratio 

odds{Gi(r)}  _  9\ 
odds{G2(0}  O2 

is  a  constant  in  f,  where  odds(a)  =  a/(  1  -  a). 

This  paper  was  inspired  by  the  idea  of  representing  a  semiparametric  model  as  a  mixture 
(frailty)  model,  and  to  use  the  EM  algorithm  to  fit  it.  With  this  idea  in  mind,  consider  a  PH 
mixture  model 

G(t\f3,z)  =  £{F(0WAz) \z},  (10) 

where  F=  exp (-//)  is  the  base-line  survival  function  corresponding  to  //,  and  U  =  U(/3,z)  is 
used  to  indicate  that  the  distribution  of  random  variable  U  depends  on  covariates  and  regression 
coefficients.  This  model  can  be  considered  a  compact  expression  for  a  family  of  so-called  PH 
frailty  models,  or  PH  models  with  random  effects  considered  by  Hougaard  (1984),  Klein  (1992), 
Nielsen  et  al  (1992),  Wassel  and  Moeschberger  (1993),  Clayton  and  Cuzick  (1985)  and  many 
others,  for  different  distributions  of  U,  possibly  dependent  on  covariates. 

To  construct  the  EM  algorithm  for  a  particular  model  (PO  in  the  example),  we  represent  it  as 
a  PH  mixture  model  (inverse  transform),  and  then  follow  the  usual  logic  of  the  EM  algorithm 
construction  for  frailty  models,  as  for  example  in  Nielsen  et  al  (1992). 

3. 1.  Inverse  transform 

We  note  that  £(j|-)  —  £[exp{-,y  £/(•)}]  is  the  Laplace  transform  of  £/(•),  and  that  for  the  PH 
mixture  model  (10) 

GOD  =  C{H(t)\ •}  =  £[-log{F(0}|-]- 

From  the  latter  equation  and  equation  (9),  we  conclude  that  U  for  the  PO  model  represents 
exponential  regression,  as  C(s |-)  =  6(-)/{6( •)  +  s}  is  the  Laplace  transform  of  an  exponential 
distribution  with  mean  9~] . 

3.2.  Complete-data  likelihood 

With  the  PH  mixture  model  (10),  pretend  that  U  is  known  for  each  subject  i 7,  continuing  the 
notation  of  Section  2.  The  complete-data  likelihood  under  non-informative  right  censoring 
corresponds  to  the  PH  model  with  predictors  Uij 

/cd  =  f:  {  A  log(A///)  -  £  UijHi 

i=\  l  jeCjUVj 


(11) 
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3.3.  E-step 

Since  the  complete-data  likelihood  (1 1)  is  linear  in  missing  data  Ujj,  the  E-step  reduces  to  impu¬ 
tation  of  each  U  by  the  corresponding  0,  the  conditional  expectation  of  U,  given  the  observed 
event.  Using  the  exponential  distribution  of  U  with  mean  9~\  after  a  little  algebra,  we  obtain 


J  u  F“(uh)c9  exp (—9u)  d u 
J  F“(uh)c9e\p(—9u )  d« 


r(c  +  2)9/ (9  +  H)c+1  _  c  +  1 
r(c  +  l)0/(0  +  W)f+'  “  9(13,  z)  +  H(t)  ’ 


(12) 


where  h  is  the  hazard  function  corresponding  to  H.  A  similar  derivation  of  0  for  a  gamma 
frailty  model  can  be  found,  for  example,  in  Parner  (1998). 


3.4.  M-step 

Maximization  of  the  complete-data  likelihood  (1 1)  with  respect  to  H,  and  with  Ujj  imputed  by 
Ujj,  results  in  the  Nelson-Aalen  estimator 

A U,„  =  Dn,  j  X/  Uij,  m  —  \,...,n, 

/  ijtzTZ'm 

where  7 Zm  =  {i  j  :  j  e  T>,  U  Cj,i  ^  m}  is  the  set  of  subjects  at  risk  just  before  tm. 


3.5.  EM  procedure  for  the  proportional  odds  model 

Finally,  for  the  PO  model  we  have  the  iterative  EM  procedure 


A  //,<f+1)  =  Dm 


Cjj  -b  1 


ijeKm  8({3,Zjj)  +  H- 


(k) 


where  k  counts  iterations. 


m  = 


(13) 


3.6.  Alternative  derivation  of  procedure  (13) 

It  is  intriguing  that  we  can  formally  derive  procedure  (13)  as  an  immediate  corollary  of  the 
argument  presented  in  Section  1 .  Indeed,  using  equation  (9),  we  write  the  likelihood  for  the  PO 
model  as 


l  =  EDi  log(A Hi)  +  £  log 

/=1  jeCiUVj 


0Q39Xij) 


W,z,7)  +  Hi}cu+X 


(14) 


On  differentiating  equation  (14)  with  respect  to  A  Hm,  and  assigning  the  iteration  index  k  as  in 
equations  (l)-(3),  we  obtain  expression  (13). 

This  observation  deserves  discussion.  The  EM  derivation  presented  above  for  the  PO  model 
is  model  specific  and  its  feasibility  depends  on  the  success  and  simplicity  of  the  inverse  Laplace 
transform  and  the  integrals  that  are  evaluated  at  the  E-step  (12).  The  PH  mixture  representation 
of  a  semiparametric  model  may  not  exist,  in  which  case  the  EM  derivation  ultimately  fails.  Nec¬ 
essary  and  sufficient  conditions  for  this  representation  to  exist  are  a  corollary  of  the  Bernstein 
theorem  (Feller,  1971):  the  survival  function  must  be  a  completely  monotonic  function  of  H . 
A  function  ip(H)  is  called  completely  monotonic  if  all  its  derivatives  exist,  (  =  1,2,...,  and 
(-1 )'  ^  0,  H  >  0.  In  particular,  the  survival  function  (10)  of  the  PH  mixture  model  is 
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an  infinitely  differentiable  function  of  F.  The  alternative  derivation  of  procedure  (13)  bypasses 
all  the  above-mentioned  difficulty  and  formally  works  for  any  model  in  a  straightforward  and 
simple  fashion.  This  raises  a  series  of  questions.  Does  the  procedure  of  Section  1  work  for  any 
model?  What  is  its  relationship  to  the  EM  algorithm?  Does  it  inherit  the  monotonicity,  stability 
and  convergence  of  the  EM  algorithm? 

A  clue  to  generalizing  the  EM  algorithm  described  above  is  the  observation  that  the  der¬ 
ivation  of  the  E-step  (12)  does  not  require  knowledge  of  the  distribution  of  U.  Indeed,  de¬ 
note  by  j(x)  the  moment-generating  function  of  U  (other  arguments  are  omitted),  so  that 
j(x)  =  E(xu)  =  £{—  log(x)}.  Observe  that  the  first  equation  in  expression  (12)  can  be  written 
as 


-  _  E(FuUc+')  _  _7(c+1)(E) 

E(FuUc)  °  7 {C\F)  ’ 


(15) 


where  7(rl  denotes  the  derivative  of  order  c;  j(0)  :=j,  c  =  0,1.  Expression  (15)  represents  a 
variation  on  the  topic  of  the  derivation  of  moments  from  the  transform  of  a  distribution.  The 
consequence  of  equation  (15)  is  a  straightforward  and  general  specification  of  the  £-step  for 
any  mixture  model  formulated  in  terms  of  the  moment-generating  function.  In  fact,  it  is  even 
more  general  as  will  be  shown  in  what  follows.  To  elaborate  further  on  the  issues  raised  above, 
we  need  to  make  the  few  theoretical  observations  considered  in  the  next  section. 


4.  General  concepts 

4. 1.  Construction  using  the  difference  of  two  concave  functions 

For  studying  procedure  (3),  the  following  MM  construction  (Lange  et  al. ,  2000)  is  useful.  Let 
/(x)  =  B{\)  -  A(\ ),  where  A  and  B  are  differentiable  concave  functions.  The  iterative  maximi¬ 
zation  procedure, 

VB(x(*+ '))  =  VA(x(*}),  (16) 

where  VA(x)  =  dA/dx  is  the  gradient  of  A,  represents  an  MM  algorithm,  as  follows  from  con¬ 
vexity  arguments.  The  surrogate  objective  function  for  the  above  construction  has  the  form 

Q(x|xw)  =  B(x(k))  -  A(x)  +  VtA(xw)(x  -  x®).  (17) 

4.2.  EM  construction 

Let  £  be  the  observed  event,  and  U  be  a  random  variable  (vector)  representing  missing  data. 
The  EM  algorithm  is  a  method  to  maximize  the  log-likelihood  function  /(x)  =  log{L(x)}  of 
the  form 


L(x)  =  E{Lo(x\U)},  (18) 

where  Lo(x\U)  is  the  conditional  likelihood,  given  missing  data  (the  likelihood  constructed  to 
estimate  x  as  if  U  were  a  covariate),  and  x  does  not  include  parameters  of  the  distribution  of 
U.  To  facilitate  ‘distribution-free’  generalizations,  we  intentionally  avoid  explicit  expressions 
involving  the  distribution  of  U,  and  we  use  the  conditional  rather  than  the  joint  likelihood  of  U 
and  £  to  represent  the  EM  procedure.  In  this  construction,  unknown  parameters  entering  the 
distribution  of  U  do  not  participate  in  the  procedure,  and  the  maximization  is  considered  with 
respect  to  parameters  x  only.  For  any  function  f(U),  the  conditional  expectation  of  f{U),  given 


observed  event  £  and  x,  is  represented  as 
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E{f(U) \£,x} 


E{f(U)  Lp(x\U)} 
E{L0(x\U)} 


This  suggests  the  following  explicit  functional  notation  for  the  conditional  expectation  operator 


E(f\g)  := 


E(fg) 
E(g )  ’ 


(19) 


for  any  functions  /  and  g  of  U,  where  U  is  a  random  variable,  and  E(g)  is  the  probability  of  the 
condition.  A  standard  Jensen  inequality  argument  shows  that,  with  this  notation, 


G(x|y)  =  /( y)  +  E  {l0(x\U)  -  lo(y\U)\Lo(y\U)} ,  /0  -  log(L0),  (20) 

is  a  surrogate  objective  function  for  the  target  function  /(x).  The  operation  of  finding  U  such 
that  Iq(x\U)  =  E{k(x\U)\L0(y\U)}  is  referred  to  as  missing  data  imputation.  If  imputation  is 
easy  (E-step),  maximization  of  Q  with  respect  to  x  reduces  to  that  of  /o(x|f7),  a  complete-data 
problem. 

To  prove  that  any  converging  sequence  x(^  x*,  designed  according  to  equation  (4),  gives 

us  a  stationary  point  in  the  limit,  we  follow  the  argument  at  x  =  y  —  x* 

96(x|y)  _  E{dL0(x\U)/dx}  _  _ 

ax  L(  y)  L(y)  dx  {  } 

which  implies  that  the  score  equation  dL{x)/dx  =  0  is  satisfied  in  the  limit. 

The  EM  algorithm  proceeds  by  iterations  £{/g(x(*+1)|lOLo(x^|l/)}  =  0,  where  at  the  kth 
iteration  this  equation  is  solved  for  x(MK 


4.3.  Quasi-EM  construction 

Let  us  revisit  the  EM  construction  under  the  question  what  properties  of  the  E-operator  did  we 
actually  use  in  the  derivation  of  Section  4.2?  They  are  conditional  expectation  performed  ac¬ 
cording  to  expression  (19),  linearity  and  the  Jensen  inequality  in  equation  (20)  and  interchange- 
ability  of  E  and  d/dx  in  equation  (21).  Operators  satisfying  these  properties  will  be  called  QE. 
As  soon  as  a  QE  operator  satisfying  these  requirements  and  such  that  L(x)  =  QE{Lo(x|(7)}  is 
constructed,  the  likelihood  function  L  can  be  maximized  by  an  MM  algorithm  with  the  surro¬ 
gate  objective  function  as  in  equation  (20)  with  E  replaced  by  QE.  The  rationale  behind  this 
substitution  is  that  the  evaluation  of  E  requires  that  U  be  a  random  variable,  and  that  we  know 
its  distribution,  whereas  that  of  QE  does  not. 

Formally,  let  B  be  some  set  of  basis  functions  (including  the  function  /(«)  ==  1),  and  S  be  a  set 
of  all  admissible  functions  stretched  on  B  using  linear  combinations.  In  other  words  S  consists 
of  functions  /  such  that  /  =  E  \a\fi  for  any  sequence  (finite  or  infinite)  of  functions  {/;},  f\  e  B , 
and  real  numbers  {a/}. 

Define  QE  as  a  linear  functional  mapping  S  to  real  numbers  such  that 

(a)  QE(1)  :=  1,  where  1  means  a  function  that  is  equivalent  to  1  (/ normalization ), 

(b)  for  any  /  =  £/  aji  eS,  fie  B,  QE (f)  :=  £,•  a,  QE (/})  0 linearity ), 

(c)  for  any  function  f(u,a )  e  5,  and  such  that  df(u,a)/da  e  S , 

9QE{f{U9a)}  _  fdfjU9d)\ 
da  ^  \  da  J 


(, interchangeability ), 
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(d)  as  in  expression  (19),  for  any  functions  g and  fgeS and  such  that  QE (g) ^ 0, 


QE(/|fl)  := 


QE  (/<?) 

QE(5) 


(i conditional  QE)  and 

(e)  given  the  functions  g ,  fg  and  g  1  og(f)  e  S , 

QE{log(/)|5)  <  log{QE(/|5)} 


(22) 


(23) 


(Jensen  inequality). 

Let  us  consider  the  QE  requirements  more  closely.  We  start  by  postulating  one  basis  function 
fo(U,a)  and  the  value  of  QE  on  that  basis  function  QE(/o)  :=7 (a),  where  7  is  some  function 
of  a.  Dependent  on  how  many  times  we  are  allowed  to  differentiate  under  the  QE  symbol,  the 
derivatives 


f(i\U,a)  =  d(i)  fo(U,  a) /da' 


must  also  be  included  in  the  set  of  admissible  functions  S,  so  that  derivatives  of  QE(/o)  can  be 
defined.  Moreover,  QE  on  fu\  i  =  1, 2, ....  are  automatically  defined  by  the  interchangeability 
property  through  derivatives  of  7,  as  QE{/(,)}  =  7(,).  As  we  can  see,  QE  construction  is  cloned 
from  /o  and  7.  Mathematical  expectation  E{  fo(U,a)}  is  an  integral  transform  of  U,  where  a  is 
an  argument  of  the  transform.  Dependent  on  the  choice  of  the  function  /o,  QE  mimics  differ¬ 
ential  properties  of  the  corresponding  transform,  whereas  the  function  7  is  not  necessarily  an 
integral  transform. 

To  study  procedure  (3),  a  QE  construction  based  on  moment-generating  functions  is  useful. 
This  construction  is  cloned  from  the  basis  function  fo(U,a)  =  au ,  0<a^l,{/^0.  For  a 
mathematical  expectation,  E(fo)  =  7  is  the  moment-generating  function  of  U.  If  we  want  to 
be  able  to  differentiate  twice  under  the  QE  symbol,  we  need  two  more  basis  functions  Uau  and 
U2au ,  so  that 

QE(au)  =  7(a),  X 

QE  (Uau)=ai(a),  >  (24) 

QE (U2au)  =  a  j'(a)  +  a2  7" (a),  J 

by  the  interchangeability  property  with  two  derivatives  of  7  allowed.  We  now  derive  the  Jensen 
inequality  for  this  construction.  First,  noting  equation  (15),  introduce  the  conditional  QE 

0(*|c)  :=  mu\Ucxv)  =  c  +  x  ’^c}(^\  (25) 

where  we  used  expressions  (22)  and  (24)  to  obtain  the  right-hand  part  of  expression  (25),  c  = 
0, 1 .  The  function  0  is  a  surrogate  of  conditional  expectation  of  U,  given  observed  data,  and 
it  serves  as  an  imputation  operator  in  the  QEM  construction.  Next,  let  Bk  be  a  family  of 
functions  Bk  —  { Ukau  ,0  ^  a  <  1,  U  >  0}.  Then  by  using  the  fact  that,  for  f  —  f(U,a)  =  au  and 
g  =  g(U,b)  =  Ucbu, 

j-[QE{log(/)|5}  -  log{QE(/|g)}]  =  -{0(fc|c)  -  0(ab|c)}, 
da  a 


the  following  can  be  proved. 
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Theorem  1  (Jensen  inequality  for  QE).  Let  0(x|c)  as  defined  by  expression  (25)  be  a  non¬ 
decreasing  function  of  x,  c  =  0,1.  Then  inequality  (23)  holds  true  for  any  f<=B0  and 
g  e  Bo  U  B\ . 

It  is  interesting  that  mathematical  expectation  satisfies  the  assumption  of  non-decreasing  0, 
which  we  call  the  convexity  assumption  for  reasons  that  will  become  clear  later  as  we  discuss  an 
application  of  the  above  theory  to  semiparametric  likelihood. 

Theorem  2  (convexity  for  mathematical  expectation).  Let  7  be  a  function  defined  by  using  the 
mathematical  expectation  operator  E  as  7(x)  :=  E(xu),  where  U  is  a  non-negative  random 
variable.  Then  0(jc|c)  as  defined  in  expression  (25)  is  non-decreasing  in  x  for  any  c  =  0, 1. 

Proof.  The  Cauchy-Schwartz  inequality  with  functions  £(U,x)  =  Ul+C^2xu /2  and  ((U,x)  = 
Uc!2xP f 2  can  be  used  to  show  that  0'(x|c)  ^  0. 

5.  Non-linear  transformation  models 

5. 1.  Definition 

To  study  procedure  ( 1 )— (3)  in  more  detail  using  the  developments  of  Section  4,  we  need  to 
specify  a  certain  structure  of  the  likelihood  function  to  be  optimized.  To  do  this,  let  us  confine 
ourselves  to  a  large,  yet  specific,  class  of  semiparametric  survival  models.  Consider  a  parametric 
regression  model  with  support  on  [0, 1].  Let  y(x\0,z),x  e  [0, 1],  be  a  parametrically  specified 
distribution  function  in  x ,  conditional  on  covariates  z.  We  require  that  7  be  twice  differentiable 
with  respect  to  ;c  and  regression  coefficients  0 . 

We  can  now  define  a  semiparametrically  specified  survival  function  G(t\0,  z),  given  covariates, 
as 


O(t\0,z)  =  7{F(0I/3,z},  (26) 

where  the  base-line  survival  function  F  is  specified  nonparametrically.  The  class  of  models 
(26)  will  be  called  NTMs,  to  give  it  a  name.  Functions  like  7  will  be  called  NTM-generating 
functions.  An  NTM  is  obtained  by  plugging  a  nonparametrically  specified  survival  function  F 
into  a  parametric  distribution  function  7  with  the  support  compatible  with  the  range  of  F . 
One  important  subclass  of  NTMs  is  the  family  of  PH  mixture  models  (10),  for  which  7  is  a 
moment-generating  function  of  U, 


7(jc|/3,z)  =  E(xu^z^  |z). 


To  represent  a  semiparametric  model  in  the  NTM  form,  we  need  to  express  its  survival  func¬ 
tion  G  as  a  function  7  of  a  base-line  survival  function  F  (this  representation  is  not  unique  and 
is  not  always  possible).  For  example,  from  equation  (9)  we  obtain  the  PO  model  in  the  form 
G(t\-)  =  6>(-)/[0( •)  -  log{F(0}],  which  gives 


TWO 


fl(  0 

0(0  -  log(*) ' 


(27) 


The  class  of  NTMs  includes  the  class  of  linear  transformation  models  (Cheng  et  al ,  1995, 
1997).  It  is  easy  to  show  that  a  linear  transformation  model  can  be  represented  as  j(x\0,z)  = 
p[\og{0(/3,z)}  +  q(x)\9  where  p  is  a  parametrically  specified  tail  function,  q  is  an  inverse  tail 
function  and  6  is  a  predictor  (it  is  convenient  to  specify  q  as  the  inverse  of  p;  then  6  =  1 
corresponds  to  the  base-line  7(x|0  =  x). 
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5.2.  Algorithm 

In  the  survival  analysis  formulation,  under  non-informative  right  censoring,  the  contribution 
of  observations  sampled  from  an  NTM  (26)  to  the  likelihood  are  log(— dG)  and  log(G),  for  a 
failure  and  censored  observation  respectively.  We  have 

— dG(f|/3,z)  =  y{F(t)\(3,z}F(t)dH(t), 

where  -y/ (jc|->  =  d~f(x\-)/dx,  differentials  are  taken  with  respect  to  t  under  a  continuous  model 
and  we  recall  that  F  =  exp(-tf).  We  may  now  rewrite  the  likelihood  (8)  for  an  NTM  as 

/  =  E  Dt  log(A//;)  +  £  E  log{t9(/v|/3,  Zjj,  Cjj)},  (28) 

/=!  i=l  JeCjUVj 


where 

tf(*l  •  ,c)  =  xci{c\x\-), 

and  AH,  is  substituted  for  d H(tj).  It  is  easy  to  check  that  a  negative  derivative  of  log{??(F,  |  •  ,c)} 
with  respect  to  A H„,  is  represented  by  6(F,  |  • ,  c),  if  m  <  i,  and  is  equal  to  0  otherwise,  where 
the  function  0  is  defined  by  expression  (25).  Therefore,  the  construction  (l)-(3)  of  Section  1 
leads  us  to  the  iteration  scheme 


A  H«+])  =  Dm 


/  £  e(F^\(3,zu,cu). 

'  ijefcn, 


(29) 


This  procedure  is  a  generalization  of  procedure  (13)  to  the  NTM  family.  For  the  PO  model, 
substituting  equation  (27)  into  expression  (25),  we  obtain 


0(x|  •  ,c)  = 


c  +  1 

0(.)-logW 


(30) 


It  is  clear  that,  with  0  given  by  equation  (30),  the  general  procedure  (29)  turns  into  the  procedure 
(13)  derived  for  the  PO  model  in  Section  3. 


5.3.  Quasi-expectation  form  of  a  non-linear  transformation  model 

We  now  make  use  of  the  QE  theory  of  Section  4.3  to  provide  a  link  between  NTMs  and  the  QE 
operator.  Equations  (24)  summarizing  second-order  differential  properties  of  the  QE  operator 
will  be  the  main  instrument  of  this  section. 

First,  let  us  synchronize  the  development  of  Section  4.3  and  the  definition  of  NTM  (26)  in 
Section  5.1  by  assuming  that  the  function  7  that  is  used  in  both  sections  is  the  same  function. 
In  fact,  we  already  used  this  synchronization  when  we  noticed  in  the  previous  section  that  0  in 
equation  (29)  and  in  expression  (25)  is  the  same  function.  Now,  from  the  first  line  of  expression 
(24),  with  F(t)  instead  of  a,  we  obtain  the  QE  form  of  the  NT  model 

G(f|/3,z)  =  QE/3)2{F(0t/},  (31) 

where  the  subscript  /3,  z  to  the  QE  operator  indicates  that  QE  is  defined  by  using  the  function 
y(x\/3,z).  Equation  (31)  is  a  postulate  in  the  definition  of  the  QE  operator,  and  its  link  to  the 
NTMs  is  established  as  we  assume  that  QE  is  defined  by  using  an  NTM-generating  function  7. 

Now,  consider  the  likelihood  function  /  (28).  Given  an  observation  (t,z,c),c  =  0, 1,  its  con¬ 
tribution  v{F(t)\/3,z,c}  to  the  likelihood  L  —  exp(/)  can  be  written  as 

v(F |  •  ,c)  =  d(F\  ■  ,c)AHc  =  AHCFC  7(c)(F|  •  ,c)  =  QE(A HcUcFu), 
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where  the  last  equation  follows  from  the  first  two  lines  of  expression  (24)  and  linearity  of  QE. 
As  a  result,  the  likelihood  of  an  NTM  mimics  that  of  a  mixture  model 

L  =  UQE(AHcUcFu). 


Consider  the  hazard  function  A(r|z),  corresponding  to  the  survival  function  G(t |z).  Differ¬ 
entiating  the  survival  function  (26),  and  using  expression  (24),  we  have 


A(f|)  =  - 


1 

G(/|-) 


BGm 

dt 


y{F(oi-}F(o 

tTOI-} 


h(t)  = 


QE  {UF(t)u} 
QE{  F(t)u} 


h(t). 


where  h  is  the  hazard  function  corresponding  to  F.  Applying  the  definition  of  conditional  QE 
(22)  to  this  expression,  and  using  expression  (25),  we  obtain 


A(f|z)  =  QE {U\F(t)u}h(t)  =  0(F|  •  ,0)  h(t). 


This  is  a  generalization  of  the  fact  that  the  population  hazard  function  at  time  t  in  a  heteroge¬ 
neous  population  is  represented  as  a  conditional  average,  given  survival  up  to  time  t. 

Bringing  these  derivations  together  with  expression  (25),  we  have  the  following  theorem. 

Theorem  3  (QEM  construction).  Consider  a  survival  analysis  problem  for  an  NTM  gener¬ 
ated  by  the  function  y(x\/3,z),  with  fixed  covariates.  With  the  QE  operator  as  defined  in 
Section  4.3,  and  using  the  same  NTM-generating  function  7  in  its  definition,  the  following 
representations  are  valid:  survival  function, 

G{t\f3,  z)  =  QE  ^{F(t)U}l 


hazard  function , 

A(/|z)  =  QEff/IFfr)"}  h(t)  =  0(F|  •  ,0)  h{t). 


where  A  and  h  are  hazards  functions  corresponding  to  G  and  F  respectively;  likelihood  func¬ 
tion , 

l  =  t{  E  log[QE,>Z;j{(t/A//,)c'^w}]); 

1=1  V  jeCiWi  / 


imputation  operator, 

0  —  QE(U\UcFu)  =  ©(F|  •  ,c),  c  =  0, 1, 

where  U  denotes  U,  imputed  by  using  the  conditional  QE  operator. 


6.  Summary  and  justification  of  the  procedure 

Let  us  now  go  back  to  the  EM  algorithm  of  Section  3  and  see  how  the  results  obtained  since 
then  allow  us  to  streamline  and  justify  our  algorithm  construction,  using  the  PO  model  as  an 
example.  In  summary,  we  now  have  the  following  procedure. 

(a)  Obtain  the  NTM-generating  function,  representing  the  model  survival  function  as  a  func¬ 
tion  of  F.  For  the  PO  model  (9)  G{t\-)  =  6>(  )[6>(-)  -  log{ F(0}]-1 ,  we  have  equation  (27), 

7(x|-)  =  0(-){0(-)-log«}-'. 

(b)  Obtain  the  imputation  operator  (25) 

©(jc|  •)  =  c  +  jc7(c+1)(x|-)  7<c')(x|-)“'1. 
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For  the  PO  model,  this  results  in  equation  (30), 

0W-)  =  (c+l){0()-logW}-1. 

Check  that  0(jc|)  is  a  non-decreasing  function  of  jt  (see  the  justification  below). 

(c)  Obtain  the  profile  likelihood  by  iterations  (29), 

A Hj*+l)  =  Dm\  £  Q(Flk)\p,Zij,Cij) 

(d)  Maximize  the  profile  likelihood  with  respect  to  (3  as  in  Section  2. 

For  the  PH  mixture  model,  QE  is  equivalent  to  E  (compare  equations  (31)  and  (10)),  which 
makes  0  the  conditional  expectation  of  U,  given  observed  data  (compare  expressions  (25)  and 
(15)).  In  this  case,  the  above  procedure  is  an  EM  algorithm. 

Justification  of  this  procedure  works  through  the  proof  of  monotonicity  (i.e.  the  likelihood  is 
improved  at  each  step)  under  the  following  assumption. 

6. 1 .  Convexity  assumption 

Consider  an  NTM  with  the  NTM-generating  function  7.  Assume  that 

0(jc|/3,  z,  c)  is  a  non-decreasing  function  of  x,  for  any  (3,  z,  c.  (32) 

We  have  two  ways  to  show  monotonicity. 

(a)  Observe  that,  under  assumption  (32),  the  likelihood  (28),  as  a  function  of  the  vector 
AH,  represents  a  difference  between  two  concave  functions  E,  D,  log(A//,)  and 
— E^logl^F,)-)}.  This  follows  from  the  fact  that  0  is  the  negative  derivative  of  log(tf) 
with  respect  to  H.  Therefore,  monotonicity  follows  from  the  results  of  Section  4.1. 

(b)  Observe  that  the  likelihood  is  represented  as  a  QE  L  —  UQE(AHCUC Fu)  (Section  5.3), 
and  that  under  assumption  (32)  the  QE  operator  satisfies  the  Jensen  inequality  (Section 
4.3).  Therefore,  the  EM  proof  of  monotonicity  works. 

Convergence  of  the  algorithm  under  monotonicity  follows  from  the  results  of  Lange  et  al. 
(2000)  and  Wu  (1983)  under  fairly  general  conditions. 

7.  Real  data  example 

As  an  example  we  use  data  from  the  National  Cancer  Institute’s  ‘Surveillance  epidemiology 
and  end  results’  programme.  Using  the  publicly  available  database  for  the  programme,  11621 
cases  of  primary  prostate  cancer  diagnosed  in  the  state  of  Utah  between  1988  and  1999  were 
identified.  The  following  selection  criteria  were  applied  to  a  total  of  19  819  Utah  cases  registered 
in  the  database:  valid  positive  survival  time,  valid  stage  of  the  disease  and  age  1 8  years  or  more. 
Prostate  cancer  specific  survival  was  analysed  by  the  stage  of  the  disease  (localized  or  regional 
versus  distant).  For  the  definition  of  stages  as  well  as  for  other  details  of  the  data  we  refer  the 
reader  to  the  documentation  for  the  programme  at  http :  / / seer .  cancer .  gov/. 

The  PH  and  the  PO  models  with  z  representing  two  groups  corresponding  to  the  localized  or 
regional  stage  (10765  cases)  and  distant  stage  (856  cases)  respectively  were  fitted  by  using  the 
profile  MM  algorithm.  The  log-odds-ratio  was  estimated  as  /3  =  —3.251  with  95%  asymptotic 
confidence  interval  (—3.41 6,  —3.086).  A  likelihood  ratio  test  showed  that  the  difference  between 
groups  is  highly  significant  (p  <  0.0001).  Observed  (Kaplan-Meier)  and  expected  model-based 
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Time  (months) 

(a) 

Fig.  1-  Observed  ( - )  versus  expected  (* 

fitted  to  the  prostate  cancer  data 


Time  (months) 

(b) 

■)  plots  corresponding  to  (a)  the  PH  and  (b)  the  PO  mode! 


estimates  of  the  survival  functions  by  group  are  shown  in  Fig.  1.  The  PO  model  showed  a 
superior  fit  to  the  data. 

On  the  basis  of  the  PO  model,  four  different  approaches  to  model  fitting  are  compared. 

(a)  MM  or  QEM :  the  method  is  described  in  Section  6.  Maximization  of  the  profile  likelihood 
is  performed  by  the  Powell  method  (Press  et  al ,  1994). 

(b)  EM:  the  EM  method  is  similar  to  the  EM  algorithm  as  used  to  fit  frailty  models  with 
predictor  6((3,i).  Using  the  QEM  formulation,  the  procedure  is  as  follows. 

(i)  With  the  current  iteration  /3(k)  and  F(k)  compute  =  6~x  (f3(k\  z,7)  Q(F-^ \/3^k\xij) 

for  each  subject  i  j. 

(ii)  Maximize  the  partial  likelihood  for  a  PH  model  with  (imputed)  predictor  0(j3,  z ,* j) 

with  respect  to  /3.  Set  /3(M)  at  the  solution. 

(iii)  Update  the  function  F  by  using  the  Nelson-Aalen  estimator  for  the  PH  model  fitted 
at  the  previous  step.  Denote  the  solution  by  F(k+]\ 

(iv)  Set  k  =  k  -f  1 .  Continue  iterations  until  convergence. 

(c)  Parametric :  in  the  parametric  method,  the  function  F  is  specified  as  a  Weibull  survival 
function.  The  parametric  regression  model  is  fitted  by  using  the  Powell  method  applied 
to  all  model  parameters. 

(d)  Full  model  (Powell):  apply  the  Powell  method  to  maximize  the  log-likelihood  of  the  full 
semiparametric  model  with  respect  to  the  joint  vector  of  regression  coefficients  /3  and  the 
base-line  hazard  AH. 

Computation  of  9,0  or  7  is  counted  as  one  operation.  For  a  given  tolerance  s,  the  stopping 
rule  is  defined  as  /^'+1)  -  l(k)  <  s.  If  the  method  required  solving  a  nested  numerical  problem 
(MM  or  EM),  the  tolerance  for  the  nested  problem  is  specified  as  s/100. 

First,  we  evaluate  the  precision  by  operations  characteristics  of  the  above  numerical  meth¬ 
ods.  The  precision  is  measured  by  /*  -  l(k\  where  the  exact  solution  /*  was  approximated  by 
the  solution  obtained  for  e  —  10”20.  Shown  in  Fig.  2  are  the  precision  by  operations  curves  for 
the  four  methods,  obtained  by  varying  e.  It  is  clear  from  Fig.  2  that  the  profile  MM  algorithm 
outperforms  the  other  approaches  in  the  number  of  operations  that  are  required  to  reach  a 
given  precision.  The  profile  MM  method  is  closely  followed  by  the  frailty  EM  algorithm.  Fit¬ 
ting  the  full  semiparametric  model  by  the  Powell  method  shows  the  worst  performance.  The 
advantage  of  the  EM-like  approaches  compared  with  methods  that  invoke  the  function  F  into  a 
conventional  maximization  is  explained  by  the  utilization  of  a  closed  form  solution  for  F  in  the 
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Fig.  2.  Precision  of  likelihood  maximization  by  the  number  of  operations  (precision  is  measured  as  the  dif¬ 
ference  between  the  limiting  value  of  the  likelihood  as  operations  tend  to  oo  and  the  maximal  likelihood  value 
achieved  under  a  stopping  rule;  curves  closer  to  the  y-axis  correspond  to  more  efficient  numerical  methods): 
- ,  MM; . ,  EM; - ,  parametric;  . ,  full  model 


Sample  size  Sample  size 

(a)  (b) 


Fig.  3.  Numerical  efficiency  by  sample  size  for  (a)  the  full  model  ( . ,  polynomial  fit)  and  (b)  the  EM 

(□)  and  MM  (o)  methods  ( . ,  linear  fits): - ,  number  of  operations  needed  to  achieve  a  fixed  pre¬ 

cision  by  sample  size  (each  point  corresponds  to  a  sample  generated  from  a  parametric  PO  model  with  a 
Weibull  base-line  survival  function  with  parameters  specified  by  using  the  model  fit  to  data  described  in 
Section  7) 

form  of  the  Nelson-Aalen-Breslow  estimator.  For  the  same  reason,  the  MM  and  the  EM  proce¬ 
dures  show  a  linear  trend  with  increasing  dimension,  given  fixed  precision  as  shown  in  Fig.  3.  To 
obtain  Fig.  3,  samples  of  size  100-1000  were  generated  from  the  parametric  (Weibull)  PO  model 
fitted  to  the  same  data.  The  MM,  EM  and  full  model  (Powell)  procedures  were  applied  to  each 
such  sample.  The  tolerance  e  —  10" 3  was  used  for  the  MM  algorithm.  The  tolerance  for  the 
other  two  methods  was  tuned  to  give  a  likelihood  that  was  as  close  as  possible  yet  smaller  than 
the  likelihood  achieved  by  the  MM  method  (to  keep  the  comparison  conservative).  The  profile 
MM  algorithm  shows  the  most  favourable  behaviour  with  increasing  dimension,  followed  by  the 
EM  procedure.  It  comes  as  no  surprise  that  the  full  model  (Powell)  method  shows  the  greatest 
complexity. 
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8.  Conclusion 

We  presented  an  application  of  the  general  MM  principle  to  a  class  of  semiparametric  models. 
Three  methods  of  specifying  the  surrogate  objective  function  were  demonstrated.  In  particular, 
we  clarified  the  connection  between  the  likelihood-based  MM  principle  and  the  imputation- 
based  self-consistency  principle  that  is  used  in  EM  algorithms  for  semiparametric  models.  To 
study  this  connection,  we  built  an  EM-like  world  behind  the  MM  algorithm  by  using  the  QEM 
construction.  The  approaches  were  illustrated  by  using  continuous  NTMs  in  a  survival  analysis 
context.  This  is  just  one  example  of  how  these  constructions  can  be  used.  Discrete  survival 
models,  cure  models,  multivariate  semiparametric  models,  models  with  time-dependent  covari¬ 
ates  and  many  other  statistical  models  can  be  treated  by  application  of  the  principles  presented 
in  this  paper.  Construction  of  surrogate  objective  functions  is  not  straightforward.  Having  an 
option  to  work  a  particular  problem  from  both  ends  (likelihood  or  convexity  versus  imputation 
or  self-consistency)  may  increase  the  chance  of  finding  efficient  and  general  procedures  that  are 
applicable  to  large  classes  of  models. 
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Estimating  Cure  Rates  From  Survival  Data: 
An  Alternative  to  Two-Component  Mixture  Models 

A.  D.  Tsodikov,  J.  G.  Ibrahim,  and  A.  Y.  Yakovlev 


This  article  considers  the  utility  of  the  bounded  cumulative  hazard  model  in  cure  rate  estimation,  which  is  an  appealing  alternative  to  the 
widely  used  two-component  mixture  model.  This  approach  has  the  following  distinct  advantages:  (1)  It  allows  for  a  natural  way  to  extend 
the  proportional  hazards  regression  model,  leading  to  a  wide  class  of  extended  hazard  regression  models.  (2)  In  some  settings  the  model 
can  be  interpreted  in  terms  of  biologically  meaningful  parameters.  (3)  The  model  structure  is  particularly  suitable  for  semiparametric  and 
Bayesian  methods  of  statistical  inference.  Notwithstanding  the  fact  that  the  model  has  been  around  for  less  than  a  decade,  a  large  body 
of  theoretical  results  and  applications  has  been  reported  to  date.  This  review  article  is  intended  to  give  a  big  picture  of  these  modeling 
techniques  and  associated  statistical  problems.  These  issues  are  discussed  in  the  context  of  survival  data  in  cancer. 

KEY  WORDS:  Bayesian  methods;  Biologically  based  models;  Bounded  cumulative  hazard;  Cure  models;  Hazard  regression;  Semipara¬ 
metric  inference;  Survival  data. 


1.  INTRODUCTION 

In  many  clinical  and  epidemiological  settings,  investigators 
observe  cause-specific  survival  curves  that  tend  to  level  off 
at  a  value  strictly  greater  than  0  as  time  increases.  A  promi¬ 
nent  example  of  this  pattern  is  shown  in  Figure  1(a).  A  well- 
pronounced  plateau  in  this  display  of  the  Kaplan-Meier  curve 
may  be  thought  of  as  an  indication  of  the  presence  of  a  pro¬ 
portion  of  patients  for  whom  the  disease  under  study  will  never 
recur.  Alternatively,  one  can  consider  such  patients  to  be  cured. 
Clearly,  estimating  the  proportion  of  cured  patients  may  have 
important  medical  implications.  In  addition,  clinical  covariates 
may  exert  dissimilar  effects  on  the  probability  of  cure  and  the 
timing  of  tumor  relapse  or  other  events  of  interest.  This  is  ap¬ 
parent  from  Figure  1  (b),  where  two  survival  curves  for  patients 
with  localized  breast  cancer  stratified  by  age  are  presented. 
These  plots  suggest  that  the  two  categories  of  patients  are  likely 
to  have  a  similar  probability  of  cure,  whereas  a  short-term  effect 
of  age  at  diagnosis  on  cancer-specific  survival  is  highly  plausi¬ 
ble.  It  is  also  clear  that  the  commonly  used  proportional  hazards 
model  fails  to  fit  the  data  shown  in  Figure  1(b).  In  Section  3.3 
we  present  a  detailed  example  involving  prostate  cancer  sur¬ 
vival  that  has  major  biomedical  implications.  Such  an  advance 
would  have  not  been  possible  to  make  without  invoking  the 
concept  of  cure.  The  preceding  examples  suggest  at  least  two 
advantages  that  survival  models  have  for  allowing  for  the  pres¬ 
ence  of  cured  individuals:  (1)  They  enrich  our  ability  to  inter¬ 
pret  survival  analysis  in  terms  of  characteristics  that  have  a  clear 
biomedical  meaning;  (2)  they  lead  to  more  general  regression 
models,  thereby  extending  our  ability  to  describe  actual  data. 
The  latter  contention  holds  whether  the  probability  of  cure  is 
significantly  separated  from  0  (see  Sec.  3). 
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To  develop  relevant  methods  of  statistical  data  analysis,  one 
has  to  provide  a  rigorous  definition  of  cure  rate.  In  this  arti¬ 
cle,  we  will  proceed  from  the  most  widely  accepted  concept  of 
biological  cure.  The  probability  of  (biological)  cure,  variously 
referred  to  as  the  cure  rate  or  the  surviving  fraction,  is  defined 
as  an  asymptotic  value  of  the  survival  function  G  it)  as  t  tends 
to  oo.  Let  X  be  the  survival  time  with  cumulative  distribution 
function  (c.d.f.)  G(t)  =  1  —  G(t).  Under  a  continuous  model 
the  existence  of  a  nonzero  surviving  fraction,  p,  is  determined 
by  the  behavior  of  the  hazard  function,  A(r),  by  virtue  of  the 
equality 


p  =  lim  G(f)  =  exp{“  /  k(u)du 
t-+oo  [  J0 


(1.1) 


Whenever  p  >  0  the  underlying  survival  time  distribution  is 
said  to  be  improper.  Clearly,  k(u)  -*0asw->ooifp>0  and 
the  limit  of  A .(«)  (as  u  oo)  exists.  Formulated  in  terms  of  the 
marginal  failure  time  distribution,  this  definition  does  not  imply 
that  the  overall  survival  time  may  be  infinite,  because  the  time 
to  death  from  other  causes  (censoring  time)  is  finite  with  prob¬ 
ability  1.  Therefore,  the  distribution  of  the  observed  lifetime  in 
the  presence  of  competing  risks  is  always  proper,  and  there  is 
no  defiance  of  common  sense. 

Boag  (1949)  and  later  Berkson  and  Gage  (1952)  proposed  a 
two-component  (binary)  mixture  model  for  the  analysis  of  sur¬ 
vival  data  when  a  proportion  of  patients  are  cured.  Since  then, 
the  binary  mixture-based  approach  has  become  the  dominant 
one  in  the  literature  on  cure  models  (Miller  1981;  Farewell 
1982;  Goldman  1984;  Greenhouse  and  Wolfe  1984;  Gamel, 
McLean,  and  Rosenberg  1990;  Gordon  1990;  Bentzen,  Jo¬ 
hansen,  Overgaard,  and  Thames  1991;  Goldman  and  Hillman 
1992;  Kuk  and  Chen  1992;  Laska  and  Meisner  1992;  Mailer 
and  Zhou  1992,  1994,  1995,  1996;  Sposto,  Sather,  and  Baker 
1992;  Yamaguchi  1992;  Gamel  and  Vogel  1993;  Gamel,  Vo¬ 
gel,  Valagussa,  and  Bonadonna  1994;  Chappell  Nondahl,  and 
Fowler  1995;  Taylor  1995;  Gamel,  Meyer,  Feuer,  and  Miller 
1996;  Peng  and  Dear  2000;  Sy  and  Taylor  2000,  2001 ,  to  name 
a  few).  The  main  idea  behind  this  approach  is  that  any  improper 
survival  function  can  be  represented  as 

G(t)  =  E{[G0(O]"}  =/>  +  (!-  p)G0(f),  (1.2) 
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(a)  (b) 


Figure  1.  (a)  Relapse-Free  Survival  for  Patients  With  Hodgkin’s  Disease  Treated  by  Radiotherapy.  Data  from  the  International  Database  on 
Hodgkin’s  Disease,  (b)  Breast  cancer  specific  survival  in  local  stage  by  age.  Data  from  the  Surveillance  Epidemiology  and  End  Results  Program . 


where  M  is  a  binary  random  variable  taking  on  the  values  of  0 
and  1  with  probability  p  and  1  -  p,  respectively,  with 

p  =  Pr{X  =  oo}, 

and  Go(t)  is  defined  as  the  survival  function  for  the  time  to 
failure  conditional  upon  ultimate  failure,  that  is, 

Go(0  =  Pr{X  >  t\X  <  oo}.  (1.3) 

When  designing  regression  counterparts  of  model  (1.2),  it  is 
common  practice  to  use  the  logistic  regression  model  for  in¬ 
corporating  covariates  into  the  probability  p,  and  proportional 
hazards  regression  for  modeling  the  effect  of  the  covariates  on 
the  conditional  survival  function  Go. 

An  alternative,  but  equally  general,  representation  of  an  im¬ 
proper  survival  time  distribution  can  be  obtained  by  assuming 
that  the  cumulative  hazard  A (t)  =  /Qr  X(t)dt  has  a  finite  posi¬ 
tive  limit,  say  0,  as  t  tends  to  oo.  In  this  case  one  can  write 

G(t)  =  e~eF(,\  0>O,t>O,  (1.4) 

where  F(t)  =  A  (t)/6  is  the  c.d.f.  of  some  nonnegative  random 
variable  such  that  F(0)  =  0.  In  what  follows  we  will  call  the 
model  given  by  (1.4)  the  bounded  cumulative  hazard  (BCH) 
model. 

Within  the  nonparametric  framework  it  makes  no  difference 
whether  representation  (1.2)  or  (1.4)  is  used  as  a  basis  for  the 
estimation  of  p,  but  the  situation  is  not  the  same  when  Go(f)  is 
parametrically  specified.  Using  definition  (1.3),  one  can  repre¬ 
sent  the  survival  function  (1.4)  in  the  form  of  formula  (1 .2),  but 
both  p  and  Go  become  functions  of  the  common  parameter  0 : 


A  similar  confounding  occurs  when  one  represents  (1 .2)  in  the 
form  of  (1.4).  A  summary  of  useful  formulas  showing  the  rela¬ 
tionship  between  the  two  models  was  given  in  Chen,  Ibrahim, 
and  Sinha(1999). 

Although  the  models  given  by  (1.2)  and  (1.4)  are  just  two 
different  ways  of  rescaling  the  survival  function  G(/),  it  took 
a  long  time  to  realize  some  virtues  of  the  BCH  model  (which 


will  be  discussed  later),  which  is  the  main  subject  matter  of  the 
present  article.  The  first  move  in  this  direction  was  due  to  Hay- 
bittle  (1959, 1965).  The  author  proceeded  from  the  observation 
that  in  some  clinical  data  on  cancer  survival,  an  actuarial  es¬ 
timate  of  the  hazard  function  tends  to  decrease  exponentially 
with  time.  If  the  same  property  holds  for  the  true  hazard,  ex¬ 
pression  (1.4)  assumes  the  form: 

G(r)=exp{-0(1  -e~(')},  ( >  0.  (1.5) 

A  comprehensive  treatment  of  this  model  was  given  by  Cantor 
and  Shuster  (1992)  who  used  the  following  parameterization: 

G(/)  =  exp{|(l-ef')j,  f*0,  (1.6) 

where  >  0,  but  ?  may  take  either  sign.  If  f  >  0  the  survival 
function  (1.6)  corresponds  to  the  proper  Gompertz  distribution, 
but  if  f  <  0  the  distribution  is  improper  with  the  surviving 
function  equaling  ,  For  this  reason  the  distribution  given 
by  (1.6)  can  be  called  a  generalized  Gompertz  distribution.  In 
more  recent  work  Cantor  (200 1 )  used  representation  (1 .6)  to  de¬ 
termine  the  projected  variance  of  estimated  survival  probabili¬ 
ties  in  clinical  trials.  A  generalization  of  the  model  (1.6)  was 
proposed  by  Cantor  (1997)  who  suggested  approximating  a  log 
hazard  by  a  polynomial  to  obtain  an  estimate  of  the  cure  rate 
based  on  formula  (1.1). 

Clearly,  the  Gompertz-like  model  given  by  formula  (1.5)  is 
a  special  case  of  formula  (1.4)  with  the  function  F{t)  spec¬ 
ified  by  an  exponential  c.d.f.  with  parameter  f .  The  repre¬ 
sentation  (1.4)  was  first  introduced  in  an  article  of  Yakovlev 
et  al.  (1993)  and  discussed  later  as  an  alternative  to  the  mix¬ 
ture  model  by  Yakovlev  (1 994).  Interestingly  enough,  Yakovlev 
et  al.  (1993)  proceeded  from  purely  biological  considerations; 
the  idea  of  imposing  a  constraint  on  the  behavior  of  the  haz¬ 
ard  function  was  introduced  later  in  Tsodikov,  Loeffler,  and 
Yakovlev  (1998b).  In  fact,  the  authors  proposed  a  simple  mech¬ 
anistically  motivated  model  of  tumor  recurrence  yielding  an 
improper  survival  time  distribution.  Under  this  model  the  prob¬ 
ability  of  tumor  cure  is  defined  as  the  probability  of  no  clono- 
genic  tumor  cells  (clonogens)  surviving  by  the  end  of  treatment. 


Tsodikov,  Ibrahim,  and  Yakovlev:  Estimating  Cure  Rates 
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The  cell  is  called  clonogenic  if  it  is  capable  of  producing  a  cell 
clone,  that  is,  a  group  of  cells  that  have  this  cell  as  their  com¬ 
mon  parent.  There  is  biological  evidence  that  the  majority  of 
recurrent  tumors  are  clonal  in  origin;  that  is,  they  arise  from  a 
single  progenitor  cell.  In  cancer  studies  the  primary  endpoint  is 
conventionally  the  time  to  failure,  referred  to  as  survival  time 
or  failure  time,  X,  with  the  event  of  failure  being  either  tu¬ 
mor  recurrence  (disease-free  survival)  or  death  caused  by  the 
cancer  under  study  (cancer-specific  survival).  According  to  the 
clonal  model  of  posttreatment  tumor  development  proposed  by 
Yakovlev  et  al.  (1993),  a  recurrent  tumor  arises  from  a  single 
clonogenic  cell.  Every  surviving  clonogen  can  be  characterized 
by  a  latent  time  (termed  the  progression  time)  during  which  it 
could  potentially  propagate  into  an  overt  tumor.  Let  M  be  the 
number  of  clonogens  remaining  in  a  treated  tumor  and  (pM  be  its 
probability  generating  function  (p.g.f.).  Assuming  that  progres¬ 
sion  times  for  surviving  clonogens  are  independent  andjdenti- 
cally  distributed  (i.i.d.)  with  a  common  survival  function  F,  one 
can  find  easily  that  G(r)  =^a/(T(0),  t  >  0.  This  formula  sug¬ 
gests  that  knowledge  of  the  entire  distribution  of  the  number  of 
surviving  clonogens  is  critical  for  developing  biologically  mo¬ 
tivated  survival  models  with  cure.  In  particular,  suppose  that  M 
is  Poisson  distributed.  Then  the  survival  function  G(t )  is  given 
by  the  BCH  model  (1.4)  with  the  parameter  6  interpreted  as  the 
mean  number  of  surviving  clonogens. 

A  more  general  mechanistic  model  was  considered  by  Hanin 
(2001).  Suppose  a  tumor  initially  comprising  a  nonrandom 
number  i  of  clonogenic  cells  is  exposed  to  a  fractionated  radi¬ 
ation  schedule  consisting  of  n  instantaneously  delivered  equal 
doses  D  separated  by  equal  time  intervals  r.  It  is  assumed  that 
every  cell  survives  each  exposure  to  the  dose  D  with  the  same 
probability  s  =  s(D),  given  that  it  survived  the  previous  expo¬ 
sures,  and  independently  of  other  cells.  It  is  also  assumed  that 
the  death  of  irradiated  tumor  cells  is  effectively  instantaneous. 
Assuming,  in  addition,  that  tumor  growth  kinetics  between  ra¬ 
diation  exposures  can  be  modeled  as  a  homogeneousbirth-and- 
death  Markov  process  with  birth  rate  X  >  0  and  spontaneous 
death  rate  v  >  0,  Hanin  (2001)  derived  an  explicit  formula  for 
the  distribution  of  the  random  variable  (r.v.)  M;  the  latter  turned 
out  to  be  a  generalized  negative  binomial  distribution.  The  cor¬ 
responding  p.g.f.  is  of  the  form  (Hanin  2001 ): 

<Pm(u)  =  ■  l«l  <  i.  (1-7) 

suggesting  the  following  survival  function  for  the  time  to  tumor 
recurrence 


where  a  —  1  —  co  -  s  +  scopn~\  b  —  s(copn~]  —  1),  c  =  1  — 
co  —  s  +  spn~l ,  d  =  s(pn~l  —  1),  a  =  p,  =  s/a ,  co  = 

[X  -  va  -  $(X  -  v)]/[X(l  -  a)],  F(t)  =  1  -  F(t ),  satisfying 
the  conditions:  pc  ^  1  and  X  ^  v.  The  preceding  formula  may 
be  taken  as  the  starting  point  for  the  developmentof  a  new  class 
of  regression  survival  models;  with  this  aim  in  mind,  it  makes 
sense  to  reduce  the  number  of  baseline  parameters  by  enforcing 
the  conditions:  F(0)  =  0  and  G(0)  =  1 . 


Clinically  detectable  primary  tumors  are  estimated  to  contain 
at  least  i  =  105  clonogenic  tumor  cells  ranging  up  to  probably 
109  cells  or  even  more  (Tucker  1999).  On  the  other  hand,  the 
probability,  sn ,  for  a  cell  to  survive  n  fractions  is  expected  to 
be  very  low,  especially  in  the  total  dose  range  where  cures  oc¬ 
cur  frequently.  This  provides  the  rationale  for  exploring  limit¬ 
ing  distributions  associated  with  model  (1.8),  As  was  shown  by 
Hanin  (2001),  if  n  is  fixed,  i  oo  and  s  -»  0  in  such  a  way 
that  there  exists  a  limit  A  =  lim/_).oo  isnt  0  <  A  <  oo,  then  the 
distribution  of  the  number  of  clonogens  converges  to  a  Poisson 
distribution  with  parameter  0  =  A/an~l .  This  result  disproves 
the  conjecture  (Tucker,  Thames,  and  Taylor  1990;  Tucker  and 
Taylor  1996;  Tucker  1999)  that,  due  to  cell  proliferation  oc¬ 
curring  between  fractions  of  radiation,  the  limiting  distribution 
of  the  number  of  surviving  clonogenic  cells  may  not  be  Pois¬ 
son.  However,  the  convergence  to  the  limiting  distribution  is 
quite  slow,  indicating  a  strong  point  in  the  line  of  reasoning 
presented  by  these  authors.  The  rate  of  convergence  was  eval¬ 
uated  numerically  by  Hanin,  Zaider,  and  Yakovlev  (2001)  in 
terms  of  the  total  variation  distance  between  the  exact  distribu¬ 
tion  of  the  number  of  clonogens  and  its  Poisson  limit.  Another 
useful  limiting  distribution  arises  in  the  subcritical  case  where 
(i  =  s/a  <  1.  This  condition  means  that  the  total  cell  loss  due 
to  both  causes  of  cell  death  (radiation  induced  and  spontaneous) 
prevails  on  average  over  the  cell  gain  owing  to  the  proliferation 
of  tumor  cells  between  fractions  of  radiation  dose.  An  explicit 
expression  of  this  distribution  was  derived  by  letting  i  — >  oo 
and  n  — ►  oo  so  that  ipn  y,  0  <  y  <  oo  (Hanin  et  al.  2001). 

The  book  by  Yakovlev  and  Tsodikov  (1996)  presents  sev¬ 
eral  applications  of  (1.4)  with  a  two-  or  three-parameter  gamma 
distribution  for  the  function  F.  The  authors  use  the  Hjort  test 
(Hjort  1990)  for  goodness-of-fit  testing,  which  is  a  natural 
choice  in  the  parametric  analysis  of  censored  data  without  co¬ 
variates.  In  a  recent  article,  Gregori,  Hanin,  Luebeck,  Mool- 
gavkar,  and  Yakovlev  (2002)  adopted  the  Hjort  goodness-of-fit 
test  for  testing  several  models  of  carcinogenesis.  Methods  of 
model  diagnostics  specially  designed  for  different  versions  of 
this  model  have  yet  to  be  explored. 

In  this  article,  the  current  state  of  the  art  in  methodology  of 
statistical  inference  based  on  the  BCH  model  is  reviewed.  The 
approach  under  review  has  emerged  from  cancer  studies,  and 
this  fact  necessarily  reflects  on  the  focus  of  our  discussion.  The 
history  of  the  BCH  model  is  relatively  short,  and  the  model  has 
not  yet  enjoyed  applications  to  other  human  diseases,  to  say 
nothing  of  nonmedical  applications,  such  as  estimation  of  re¬ 
cidivism  rates.  However,  it  should  be  kept  in  mind  that  other 
possible  applications  of  the  BCH  model,  including  those  men¬ 
tioned  previously,  are  every  bit  as  well  justified  as  for  the  binary 
mixture  model;  there  is  no  reason  for  practitioners  to  refrain 
from  using  the  BCH  model  just  because  it  was  originally  de¬ 
rived  in  the  context  of  cancer  development. 

The  BCH  model  has  the  following  distinct  advantages: 

1 .  It  allows  construction  of  a  rich  class  of  nonlinear  transfor¬ 
mation  regression  models  to  describe  complex  covariate  effects. 
The  class  includes  the  traditional  proportional  hazards  model 
as  a  special  case;  this  structure  is  lacking  in  the  binary  mixture 
model.  This  makes  the  BCH  model  a  natural  tool  for  studying 
and  testing  departures  from  the  proportionality  of  risks. 
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2.  In  some  settings  the  BCH  model  provides  a  biologically 
meaningful  interpretation  of  the  results  of  the  data  analysis. 
This  feature  is  especially  important  for  combining  statistical 
inference  with  other  mathematical  approaches  to  biomedical 
problems,  for  example,  optimization  of  cancer  therapy. 

3.  The  BCH  model  offers  certain  technical  advantages  when 
developing  maximum  likelihood  or  Bayesian  estimation  proce¬ 
dures. 

Before  turning  to  the  discussion  of  specific  methods  and  re¬ 
sults,  one  important  remark  is  in  order  here.  As  is  evident  from 
definition  (1 .1),  the  probability  of  cure  is  essentially  an  asymp¬ 
totic  notion.  However,  the  period  of  observation  in  actual  truth 
is  always  finite,  which  is  to  say  that  one  deals  here  with  a  prob¬ 
lem  of  prediction  rather  than  with  the  usual  type  of  statistical 
inference.  In  other  words,  additional  assumptions  as  to  the  be¬ 
havior  of  G(t)  or  X(t)  beyond  the  period  of  observation  are 
needed.  At  the  same  time  a  sample  of  i.i.d.  right-censored  ob¬ 
servations  is  available,  and  it  is  natural  that  one  tries  to  reduce 
the  problem  of  prediction  to  that  of  estimation. 

2.  NONPARAMETRIC  METHODS  FOR 
A  HOMOGENEOUS  SAMPLE 

As  was  mentioned  in  Section  1,  formula  (1.2)  is  a  general  ex¬ 
pression  for  any  improper  survival  time  distribution.  Proceed¬ 
ing  from  this  representation,  Mailer  and  Zhou  (1996)  developed 
a  theory  of  nonparametric  estimation  of  the  probability  p.  Sup¬ 
pose  that  the  survival  time  distribution  G(t)  is  absolutely  con¬ 
tinuous  and  let  t\  <  t2  <  •  •  •  <  tn  be  a  sample  (subject  to  right 
censoring)  of  the  ordered  observed  failure  times.  Mailer  and 
Zhou  (1996)  suggested  estimating  p  by 

A.  =t, »(/„),  (2.1) 

where  Gn(t)  is  the  Kaplan-Meier  estimator  of  the  underlying 
survival  function/Then  a  natural  estimator  for  the  conditional 
survival  function  Go(0  is  given  by 

Gon  (0  —  (  Gn  (?)  Ph)/(1—  Pn)i  t>0,  Pn  <  1  ■  (2.2) 

For  any  c.d.f.  K(t)  define  its  right  extreme  r k  as  xk  =  inf{t  > 
0 :  K(t)  =  1}.  The  consistency  question  for  the  estimator  pn  is 
settled  in  the  following  result  by  Mailer  and  Zhou  (1992): 

Assume  that  censoring  is  independent  and  0  <  p  <  1.  Let 
C{t)  be  the  censoring  time  survival  function.  Let  x h  be  the 
right  extreme  of  H\t)  =  1  -  G(t)C(t)  and  suppose  that  the 
c.d.f  G(t)  =  1  -  G(t)  is  continuous  at^xn  in  case  x h  <  oo. 
Let  Go(t)  =  1  -  Go(0  and  C(0  =  1  —  C(t).  Then  the  estima¬ 
tor  pn  is  consistent  if  and  only  if 

rc0  <  tc-  (2.3) 

Under  the  same  conditions  the  authors  showed  that  the  con¬ 
dition  x c0  <  x c  is  necessary  and  sufficient  for  the  conver¬ 
gence  in  probability  of  supr>0  \Gon(0  -  Go(t)l  to  0  as  n  tends 
to  oo.  Assuming  the  inequality  (2.3)  and  some  additional  mild 
conditions,  Mailer  and  Zhou  also  proved  that,  when  p  <  1, 
Jntfn  —  p)  is  asymptotically  (as  n  — ►  oo)  normally  distrib¬ 
uted  with  mean  0.  Laska  and  Meisner  (1992)  showed  that  the 
estimator  Gn  ( tn )  is,  in  fact,  the  generalized  nonparametricmax- 
imum  likelihood  estimator  of  the  proportion  p  in  the  mixture 
model  (1.2). 


As  discussed  previously,  the  inequality  x g0  <  Tc  is  both  nec¬ 
essary  and  sufficient  for  consistency  of  the  Kaplan-Meier  esti¬ 
mator  of  the  cure  probability  p.  This  inequality  may  be  thought 
of  as  a  quantification  of  “sufficient  follow-up”  (Mailer  and 
Zhou  1992).  If  the  condition  is  not  true,  then  failures  may  occur 
after  the  maximum  follow-up  period  and  it  is  not  possible  to  de¬ 
termine  which  proportion  of  the  late  censored  data  has  actually 
been  cured.  Mailer  and  Zhou  (1994)  suggested  a  statistical  test 
based  on  the  length  tn  -  t*  of  the  interval  between  the  largest 
uncensored  failure  time  t*  and  the  largest  overall  failure  time  tn . 
Intuitively,  if  this  interval  is  large,  then  the  last  failure  has  oc¬ 
curred  well  before  the  last  censoring  event  so  there  has  been 
sufficient  follow-up.  Mailer  and  Zhou  showed  that,  under  ap¬ 
propriate  regularity  conditions,  the  estimator  an  =  (1  -  Nn/n)n 
is  an  approximate  p  value  for  a  test  of  r c0  <  x c,  and  an  con¬ 
verges  to  0  in  probability  if  and  only  if  xq0  <  xc .  Here  Nn  is  the 
number  of  uncensored  failure  times  in  the  interval  (2 1*  —  tn,  t*]. 
This  test  estimates  a  significance  level  rather  than  controls  for  it, 
so  that  the  original  version  of  the  test  by  Mailer  and  Zhou  sug¬ 
gests  the  hypothesis  of  sufficient  follow-up  to  be  rejected  when¬ 
ever  the  estimated  p  value  exceeds  a  prespecified  critical  value, 
say  .05.  Later,  after  conducting  computer  simulations.  Mailer 
and  Zhou  (1996)  came  to  the  conclusion  that  this  test  was  far 
too  conservative.  It  is  clear  that  a  pertinent  test  should  be  based 
on  percentiles  of  the  sample  distribution  of  an  (or  the  closely  re¬ 
lated  statistic  qn  =  Nn /n)  rather  than  on  a  fixed  p  value.  Unfor¬ 
tunately,  the  relevant  sample  distributions  are  not  known  even 
in  large  samples.  The  idea  of  sufficient  follow-up  is  intuitively 
compelling  and  a  search  for  a  more  general  formal  definition  of 
this  notion  (which  is  not  necessarily  related  to  consistency  of 
the  corresponding  nonparametric  estimator)  should  be  contin¬ 
ued. 

When  proceeding  from  the  BCH  model,  formula  (1.1)  sug¬ 
gests  the  following  nonparametric  estimator  of  the  cure  proba¬ 
bility:  pn  —  exp{— An(f„)},  where  A n(tn)  is  the  Nelson-Aalen 
estimator  of  the  cumulative  hazard  at  the  point  of  last  observa¬ 
tion.  A  pertinent  nonparametric  estimator  for  F(t)  can  be  pro¬ 
posed  in  the  form: 

Fn(t)  =  \ogdn(t)/\ogdn(tn)i  (2.4) 

where  Gn  (f)  is  the  Kaplan-Meier  estimator  for  the  survival 
function  G(t). 

It  is  important  to  note  that  the  preceding  nonparametric  es¬ 
timators  yield  an  estimate  of  survival  probability  at  the  right 
end  T  of  the  observation  period.  This  estimate  is  all  we  have  to 
predict  the  behavior  of  the  survival  function  beyond  the  period 
of  observation,  and  all  such  predictions  are  final.  If  we  strictly 
follow  the  formal  definition  of  cure  rate,  we  have  to  recognize 
that  the  nonparametric  approach  implies  a  straight-line  extrap¬ 
olation  of  the  estimated  survival  function  beyond  the  period  of 
observation  T :  G(t)  is  set  to  be  equal  to  a  constant  value  of 
Gn{tn)  for  all  t  >  T.  The  same  holds  true  for  the  semiparamet- 
ric  regression  models  discussed  in  Section  3.  It  should  also  be 
noted  that  the  length  T  of  the  period  of  observation  is  implic¬ 
itly  involved  in  (2.3).  Let  us  decompose  the  censoring  time  sur¬ 
vival  function  as  follows:  C(r)  =  C*(f)[  1  —  I  (t  —  71)],  where 
I(x)  =  0  for  x  <  0  and  I(x)  =  1  for  x  >  0.  Then  inequal¬ 
ity  (2.3)  is  replaced  by  x g0  <  minftc*,  T}.  The  finiteness  of 
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the  follow-up  period  is  relevant  to  large-sample  studies,  where 
the  value  of  T  should  be  kept  fixed  when  increasing  the  sample 
size;  otherwise,  ensuring  asymptotic  properties  would  imply  in¬ 
finite  observation  time. 

Another  feature  inherent  in  the  nonparametric  approach  has 
to  do  with  the  instability  (high  variance)  of  the  nonparametric 
estimators.  It  is  a  well-known  fact  that  the  Kaplan-Meier  es¬ 
timator  becomes  highly  unstable  for  t  close  to  the  end  of  an 
observation  period  in  the  presence  of  heavy  censoring  (Pepe 
and  Fleming  1989;  Cantor  and  Shuster  1992;  Tsodikov  2001), 
which  may  have  a  very  significant  effect  on  the  accuracy  of  cure 
rate  estimation .  The  method  for  testing  for  sufficient  follow-up, 
proposed  by  Mailer  and  Zhou  (1994,  1996),  suffers  from  the 
same  kind  of  instability  as  well.  The  nonparametric  estimator 
Fn(t)  =  \ogGn(t)/\ogGn(tn)  is  particularly  sensitive  to  varia¬ 
tions  of  Gn(tn)  in  the  denominator.  Therefore,  it  is  not  recom¬ 
mended  to  use  it  in  its  original  form.  An  improved  estimator 
was  proposed  in  Tsodikov  (2001).  This  is  a  two-stage  estima¬ 
tor.  First,  Gn(tn)  is  estimated  parametrically.  Then  G(t)  is  es¬ 
timated  nonparametrically  with  Gn(tn)  constrained  to  be  equal 
to  the  corresponding  value  of  the  parametric  estimate. 

3.  SEMIPARAMETRIC  REGRESSION  SURVIVAL 
MODELS  WITH  CURE 

3.1  Mixture  Models  and  Generalizations 

So  far  we  have  followed  two  distinct  lines  of  reasoning  in  the 
discussion  of  cure  models:  statistical  and  mechanistic.  We  now 
consider  more  formal  relationships  between  the  two  aspects  of 
the  problem. 

A  cure  model  can  be  formulated  by  making  assumptions 
about  either  the  hazard  or  the  survival  function.  For  example, 
making  the  assumption  on  the  bounded  cumulative  hazard  in  a 
proportional  hazards  (PH)  model,  we  obtain  the  so-called  im¬ 
proper  PH  model 

G(f|/?,z)  =  expM(0,z)F(O},  (3.1) 

where  P  is  a  vector  of  regression  coefficients,  z  is  a  vector  of 
co variates,  and  0(P,  z)  is  a  known  function  relating  p  to  z.  In 
what  follows  we  allow  for  more  than  one  predictor,  and  for  an 
arbitrary  parameterization  of  regression  predictors.  However,  in 
the  examples  the  most  common  exponential  parameterization  is 
used,  in  which  0(j8,  z)  =  exp {jS'z},  where  P'  is  the  transposed 
P  vector.  The  component  po  of  the  vector  p  =  (Po,  P\, ...)'  is 
the  intercept  term;  therefore,  zo  =  1  in  the  vector  of  covariates 
Z=(Z0'Zlf--)' 

A  general  class  of  semiparametric  regression  models,  nonlin¬ 
ear  transformation  models  (NTM),  was  proposed  in  Tsodikov 
(2002, 2003): 

G(t\z)  =  y(F(t)\p,z),  (3.2) 

where  y(x\p,z)  is  some  parametrically  specified  cumulative 
distribution  function  in  x  with  support  on  [0,1].  Although  our 
discussion  allows  for  any  parameterization  of  y  in  terms  of  P 
and  z,  in  the  examples  we  assume  that  y  is  parameterized 
through  a  set  of  parameters/predictors  0,  where  each 

predictor  is  further  parameterized  using  generally  different  sets 
of  regression  coefficients  P\,P2>  •••>  so  that  0  =exp(/3jz}, 


r]  =  expf/^z).  The  linear  transformation  models  considered  by 
Cheng,  Wei  and  Ying  (1995)  represent  a  subclass  of  NTM  with 

y(x\p,  z)  =  p[\og0(P,z)  -F q(x)]y  (3.3) 

where  p  is  a  tail  function  (=1-  c.d.f.)  and  q  is  an  inverse  of 
a  tail  function  (not  necessarily  that  of  p).  Cure  models  repre¬ 
sented  by  the  NTM  class  were  introduced  in  Tsodikov  (2002). 
To  make  (3.2)  a  cure  model,  the  following  assumptions  are 
made  to  enforce  the  limit  G(t\P,  z)  p  >  0: 

y(0\P,  z)  =  h(P,z)  >  0,  lim  F(f)=0.  (3.4) 

t—>  oo 

The  restriction  lim,_+oo  F(t)  =  0  proposed  by  Taylor  (1995) 
in  the  context  of  the  two-component  mixture  model  removes 
the  overparameterization  of  the  description  of  the  baseline  cure 
rate  through  F  and  h  and  the  associated  estimation  and  conver¬ 
gence  problems  forthe  fitting  algorithms.  Following  the  restric¬ 
tion,  the  estimate  F  is  assumed  to  satisfy  F(last  failure)  =  0. 
This  restriction  is  also  necessary  for  separation  of  the  long-  and 
short-term  covariate  effects  on  survival.  The  preceding  restric¬ 
tion  should  be  observed  when  parameterizing  the  model,  as  dis¬ 
cussed  in  the  following  examples. 

Alternatively,  a  cure  model  can  be  formulated  as  a  two-stage 
model.  First,  an  unobservable  random  variable  M  is  postulated 
with  distribution  p(m\p,  z)  that  depends  on  covariates.  Second, 
the  observed  survival  function  is  formed  as 

G(t\p,z)=E[F(t)M\z],  (3.5) 

where  the  expectation  is  taken  with  respect  to  M,  given  z,  and 
the  distribution  of  M  depends  on  P  and  z.  The  model  (3.5)  rep¬ 
resents  a  generalized  PH  frailty  model,  which  we  simply  call 
the  PH  mixture  model,  introduced  in  Tsodikov  (2002,  2003), 
where  M  is  assumed  to  be  an  arbitrary  nonnegative  random 
variable.  In  this  context  M  can  be  interpreted  as  a  missing  co¬ 
variate  in  a  PH  model.  Obviously,  the  usual  PH  frailty  model 

G(t\p,z)  =  E[mue^z)\z],  (3.6) 

where  the  distribution  of  U  is  independent  of  covariates,  is 
a  particular  case  of  (3.5)  with  M  =  U6(P,  z).  Missing  vari¬ 
ables  U  dependent  on  covariates  have  been  considered  by  sev¬ 
eral  authors  (e.g.,  Wassel  and  Moeschberger  1993)  within  the 
shared  PH  frailty  framework.  It  should  be  noted  that  the  PH 
mixture  model  (3.5),  although  at  least  as  general  as  (3.6)  with 
U  =  U(P,  z),  leads  to  a  computationally  efficient  estimation 
procedure  that  we  consider  in  the  next  section.  Because  all 
models  considered  previously  in  this  article  are  particular  cases 
of  (3.5),  an  estimation  procedure  designed  for  the  PH  mixture 
model,  or  for  the  NTM  subclass  (3.2)  restricted  to  (3.4),  repre¬ 
sents  a  universal  tool  for  statistical  inference  with  such  models. 

The  discrete  mixture  model  [i.e.,  (3.5)  with  a  discrete  random 
variable  M],  which  was  used  in  particular  forms  as  a  mechanis¬ 
tic  model  in  Section  1 ,  can  be  linked  to  the  NTM  class.  Observe 
that  the  p.g.f.  of  a  discrete  random  variable  M 

00 

<pm(x) = Pmx>n 

m=  0 

is  a  distribution  function  in  x  with  the  support  [0,  1  ] .  Indeed,  be¬ 
cause  the  pm  are  nonnegative,  <pm  (*)  is  increasing  in  x.  Also, 
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<Pm(1)  =  =  1.  If  Af  is  discrete  and  depends  on  co¬ 

variates,  (3.5)  can  be  rewritten  as  an  NTM  with 

Y(x\z)  =  (Pm(x\z)> 

With  the  discrete  mixture  model,  the  probability  of  cure  is  given 
by  y(0\z)  -  <pm(0\P,  z)  =  Po(P>  z). 

Alternatively,  we  may  want  to  write  an  NTM  model  as  a  dis¬ 
crete  mixture  model.  It  should  be  noted  that  NTM  is  a  wider 
class  than  that  of  mixture  models.  For  an  NTM  to  be  a  discrete 
mixture,  we  may  require  that  y (x\fi,  z)  be  an  analytic  function 
of  x  on  [0, 1],  As  such,  it  can  be  expanded  in  a  power  series 
about  x  =  0.  Additionally,  the  coefficients  of  the  corresponding 
power  series  must  be  positive.  Should  this  be  the  case,  they  can 
be  interpreted  as  probabilities,  and  y  as  a  generating  function 
of  some  discrete  random  variable.  If  y  is  a  mixture  model  (dis¬ 
crete  or  continuous),  then  =  y(e~k)  is  the  Laplace  trans¬ 
form  of  the  distribution  of  the  mixing  variable  M.  Generally, 
a  necessary  and  sufficient  condition  for  y(x|*)  to  be  a  mixture 
model  is  that  be  a  completely  monotonic  function  as  given 
by  the  Bernstein  theorem  (see  Feller  1971).  A  function  is 
called  completely  monotonic  if  all  of  the  derivatives  exist 
and(-l)V('°W>0,A>0. 

We  can  now  represent  the  cure  models  discussed  earlier  in 
this  article  as  members  of  the  mixture-NTM  model  family.  It 
should  be  stressed  that  such  a  representation  is  not  unique. 

The  improper  PH  model  (3.1)  corresponds  to 

y(x\fi9z)  =exp{-0QJ,z)(l  -*)}.  (3.7) 

This  is  the  generating  function  of  the  Poisson  distribution,  and 
by  expanding  y  about  x  —  0,  we  get  a  power  series  with  Pois¬ 
son  probabilities  as  coefficients.  This  gives  us  the  mechanistic 
interpretation  of  the  improper  PH  model  discussed  in  Section  1 . 

Consider  an  extension  of  the  improper  PH  model  allowing 
for  dissimilar  covariate  effects  on  long- and  short-term  survival. 
To  construct  an  extended  hazard  model  (the  term  was  intro¬ 
duced  by  Etezadi-Amoli  and  Ciampi  1987),  we  employ  the  fact 
that  F  is  a  survival  function.  Incorporating  covariates  into  F, 
we  can  add  a  short-term  effect  to  the  improper  PH  model.  The 
class  of  extended  hazard  models 

G(/|/?,z)  =  exp{-0(/3„z)[l-y(F(f)liS2.z)]}.  (3.8) 

where  y  is  an  NTM  and  P  =  (>3 x ,  >52),  Pi  =  (PiO,  Pi\>  ••  *),  i  = 
1,  2,  was  introducedin  Tsodikov(2002).If  y  is  a  mixture  model 
itself,  then  (3.8)  is  also  a  mixture  model  with  y  =  exp{— 9[\  — 
y)}.  The  mixing  variable  M  that  generates  the  family  (3.8)  has 
a  well-defined  structure  of  a  compound  variable 

V 

M  =  (3.9) 

k=  1 

where  v  is  a  Poisson  random  variable,  =  0,  and  the  8k  are 
i.i.d.  copies  of  a  random  variable  8  with  Laplace  transform 
y(e~k).  The  binary  distribution  for  v  gives  rise  to  the  two- 
component  mixture  class  of  models 

G(t\p,  z)  =  p{fix ,  z)  +  p{fix ,  z)y[ T(t)\fi2,  z], 

P=l-P-  (3.10) 

Kuk  and  Chen  (1992)  proposed  a  regression  model  in  the  form 
of  (3.10)  with  p  being  a  logistic  regression  and  y  a  propor¬ 
tional  hazards  regression.  This  model  was  further  studied  by 
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Sy  and  Taylor  (2000)  and  Peng  and  Dear  (2000).  The  Poisson 
distribution  for  v  leads  to  the  bounded  hazard  family  of  mix¬ 
ture  models.  For  example,  let  8  be  degenerate  (nonrandom): 
Pr(<5  —  ti(P 2,z))  =  1.  Then  M  =  r\(p2,z)v,  where  v  is  Pois¬ 
son  with  expectation  0  (ft  i ,  z) .  Then  we  have  the  PHPH  model 

G(t\fi,z)  =  exp[— 0(/S„ z){  1  -  F(t)^2.«}].  (3.1 1) 

The  model  (3.11)  was  proposed  by  Broet  et  al.  (2001)  in  the 
context  of  two  sample  score  tests  for  long-  and  short-term  co¬ 
variate  effects. 

In  the  preceding  cure  models,  restriction  (3.4)  needs  to 
be  observed.  In  models  (3.8)  considered  so  far,  one  predic¬ 
tor  (/?)  was  used  to  describe  the  short-term  effect.  To  avoid 
overparameterization  of  the  model,  the  intercept  component 
of  $2  has  to  be  fixed  or  removed,  for  example,  by  setting 
P20  =  0.  With  this  coding  of  the  model,  the  intercept  term 
in  the  first  predictor  (0),  /?io,  corresponds  to  the  baseline  log 
cure  rate.  With  an  exponential  parameterization  of  the  predic¬ 
tors,  the  baseline  survival  function  takes  the  form  Gb(t  |/J0,  z)  = 
exp{-  exp(/?io)[l  -  k(F|0)]},  where  0  =  (0, 0, . . .)'  and  P0  = 
((PlOy  0, 0, . . .),  0)'.  As  in  the  Cox  model  (Cox  1972),  the  re¬ 
gression  coefficients  pi j ,  i  =  1,2,  j  —  1,2,...,  correspond 
to  the  relative  effects  of  the  covariates  uj  on  the  long-  and 
short-term  predictor,  i  —  1,2,  respectively.  For  example,  with 
the  PHPH  model,  Pij,  i  =  1,2,  j  =  1,2,...,  is  the  log  of  the 
hazard  ratio  in  the  two  PH  models  of  which  the  PHPH  model 
is  composed,  corresponding  to  the  long-  and  short-term  effect, 
i  =  1,2,  respectively. 

3.2  Estimation  Procedures 

The  issue  of  the  potentially  infinite  dimension  has  been 
the  most  critical  deterrent  to  the  use  of  maximum  likelihood 
estimation  (MLE)  in  semiparametric  regression  models.  Meth¬ 
ods  based  on  the  partial  likelihood  are  specific  to  the  propor¬ 
tional  hazards  model  and  do  not  extend  to  other  models.  The 
Newton-Raphson  procedure  requires  taking  the  inverse  of  the 
information  matrix,  which  gets  computationally  prohibitiveand 
unstable  with  increasing  dimension.  Characterizing  the  future 
directions  in  survival  analysis  in  their  editorial  in  Biometrics , 
Fleming  and  Lin  (2000)  pointed  out  that  “it  would  be  highly 
useful  to  develop  efficient  and  reliable  numerical  algorithms  for 
the  semiparametric  estimation. . . .”  In  the  following  sections  we 
review  some  recent  approaches  to  the  problem. 

Introduce  a  set  of  times  ti ,  i  =  1 , . . . ,  n,  arranged  in  increas¬ 
ing  order,  with  tn. j_j  =  oo.  Associated  with  each  ti  is  a  set  of 
individuals  Dt*  with  covariates  Zij ,  j  e  Vi,  who  fail  at  U,  and 
a  similar  set  of  individuals  C/  with  covariates  z ,-j,  j  e  Ci ,  who 
are  censored  at  ti.  For  any  function  A(t)  let  A,*  =  A(ti)  and 
A  Ai  =  |  A/  —  A ,  _  1 1 .  The  generalized  log-likelihood  for  an  NT 
model  (3.2)  can  be  written  as 


+  £>gy(F<|zl7)  .  (3.12) 

JeC,  1 
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3.2.1  ConventionalMaximization  Methods.  Consider  max¬ 
imization  of  the  log-likelihood  function  £(x)  with  respect  to 
the  joint  vector  of  parameters  x,  representing  regression  coef¬ 
ficients  P  and  the  nonparametrically  specified  function  F.  The 
vector  AH  represents  a  set  of  jumps  A  Hi  of  the  cumulative 
hazard  function  //,  which  can  be  used  to  parameterize  F  sov 
that  x  =  (0,  AH).  Consider  a  so-called  direction  sets  method 
(Press,  Flannery,  Teukolsky,  and  Vetterling  1994)  constructed 
as  follows: 

1 .  Given  the  current  iteration  vector  x(*\  find  the  search  di¬ 
rection  s^k\ 

2.  Maximize  /(x ^  +  ys^)  with  respect  to  the  scalar  y. 

3.  Set  x(k+])  =  x(^  +  y*s^k\  where  y *  is  the  solution  at  the 
previous  step. 

4.  Set  k  =  k+  1  and  continue  to  step  1 . 

The  Newton-Raphson  method  is  one  example  of  the  direc¬ 
tion  sets  methods.  To  avoid  taking  the  inverse  of  the  full-model 
information  matrix,  the  search  direction  s  can  be  specified  ac¬ 
cording  to  the  Powell  method,  which  uses  multiple  line  max¬ 
imizations  in  one  dimension  to  construct  a  set  of  conjugate 
directions  (Press  et  al.,1994). 

3.2.2  Profile  Likelihood  Approach.  Assume  that  we  have  a 
method  to  obtain  the  global  maximum  of  i  with  respect  to  F, 
given  fi.  We  discuss  two  such  methods  in  the  next  two  sections. 
We  may  write  the  profile  log-likelihood  as  an  implicit  function 
of  p: 

=  (3.13) 

where  F*  is  the  solution  of  the  problem 

tpT(P)  =  max  l%  (3.14) 

where  T*  is  the  class  of  proper  discrete  survival  functions. 
A  profile  algorithm  is  a  straightforward  nested  procedure: 

•  Maximize  £pr(P)  by  a  conventional  nonlinear  program¬ 
ming  method  (e.g.,  a  direction  sets  method). 

•  For  any  P  as  demanded  in  the  preceding  maximization 
procedure,  solve  the  problem  (3.14)  with  specified  toler¬ 
ance. 

Inference  based  on  the  profile  likelihood  is  similar  to  that 
based  on  the  partial  likelihood  for  the  PH  model.  However,  the 
classical  theory  of  MLE  does  not  apply  to  infinite  dimensions. 
Important  results  have  been  obtained  regarding  theoretical  justi¬ 
fications  for  the  nonparametric  maximum  likelihood  estimation 
(NPMLE)  method  and  the  profile  likelihood  for  semiparametric 
models  (Murphy  2000).  It  was  shown  that  profile  likelihoods 
with  nuisance  parameters  estimated  out  behave  like  ordinary 
likelihoods  under  some  conditions.  In  particular,  these  results 
apply  to  the  PH  model,  the  proportional  odds  (PO)  model,  the 
PH  frailty  model,  and  presumably  to  many  other  models. 

3.2.3  Restricted  Nonparametric  Maximum  Likelihood  Esti¬ 
mation.  Tsodikov  (2002)  developed  the  so-called  restricted 
nonparametric  maximum  likelihood  estimation  (RNPMLE) 
algorithm.  The  RNPMLE  method  is  based  on  a  recurrent  struc¬ 
ture  of  the  score  equations  with  respect  to  the  nonparamet¬ 
ric  part  of  the  model  (F).  Earlier  variations  of  the  proce¬ 
dure  were  developed  for  the  PH  model  (Tsodikov  1998a)  and 


for  the  proportional  odds  (PO)  model  (Tsodikov,  Hasenclever, 
and  Loeffler  1998a)  and  were  subsequently  generalized  for 
NT  cure  models  (Tsodikov  2002).  Although  quite  computa¬ 
tionally  intensive,  the  RNPMLE  method  is  numerically  stable 
and  it  allows  us  to  avoid  taking  inverses  or  approximating  the 
information  matrix  of  the  full  model. 

According  to  the  nonparametric  maximum  likelihood 
method,  the  model  can  be  fitted  by  maximizing  the  general¬ 
ized  log-likelihood  (3.12)  with  respect  to  the  regression  pa¬ 
rameters  p  and  the  unspecified  survival  function  F  (or  the 
corresponding  distribution  function  F).  In  doing  so,  the  log- 
likelihood  i  is  maximized  for  a  step  function  F  with  steps  at 
the  times  of  failures. 

As  discussed  earlier  we  impose  the  restriction  F(tn)  =  1.  To 
maximize  (3.12)  under  the  restriction  F(tn )  =  1,  we  use  the 
method  of  Lagrange  multipliers  and  obtain  the  system  of  score 
equations 


aid,  f) 
dp 


(3.15) 


Bi(P,  F) 
dFi 


/  =  (3.16) 


F„  =  l.  (3.17) 

Inference  procedures  for  the  regression  coefficients  P  can 
be  obtained  from  the  profile  log-likelihood  (3.14).  The  profile 
log-likelihood  is  obtained  by  solving  the  score  equations  (3.16) 
and  (3.17)  for  F  simultaneously, given  p. 

In  the  RNPMLE  method  the  system  of  score  equations  (3.16) 
is  partitioned  into  a  subsystem  of  the  form 

|i=^(Fl-i,Fl-,F/+1)=0  (3.18) 

dFi 

that  can  be  solved  recurrently  for  F2, . . . ,  F„,  given  Fo  =  0  and 
F\.  Finally,  the  equation  <p(F\)  =  1  is  solved,  where  (p  is  de¬ 
fined  as  F„  generated  recurrently  using  the  subsystem  (3.18). 
The  RNPMLE  method  yields  point  estimates  and  confidence 
intervals  for  the  regression  coefficients  and  the  function  F 
(Tsodikov  2002). 

3.2.4  EM-Based  Methods .  The  EM  algorithm  used  to  fit 
shared  frailty  models  in  survival  analysis  (Nielsen,  Gill,  Ander¬ 
sen,  and  S0rensen  1992)  handles  H  in  a  numerically  efficient 
way.  This  is  made  possible  as  the  M  step  reduces  to  the  PH 
model.  Estimates  are  obtained  iteratively  by  maximizing  the 
partial  likelihood  and  computing  the  Nelson-Aalen-Breslow 
estimator  (Andersen,  Borgan,  Gill,  and  Keiding  1993)  for  the 
cumulative  hazard  H.  Similar  algorithms  have  been  used  to  fit 
the  two-component  mixture  model,  given  by  Taylor  (1995),  Sy 
and  Taylor  (2000, 2001 ),  and  Peng  and  Dear  (2000). 

Recently,  Tsodikov  (2003)  generalized  the  EM  algorithm  for 
the  frailty  model  into  a  universal  “distribution-free”  procedure 
applicable  to  the  general  PH  mixture  model  (3.5)  and  NTM 
class  (3.2).  This  family  of  algorithms  (quasi-EM,  QEM)  is  a 
subclass  of  the  so-called  MM  algorithms  based  on  surrogate 
objective  functions  (Lange,  Hunter,  and  Yang  2000).  Broadly 
defined,  an  MM  algorithm  substitutes  a  computationally  sim¬ 
pler  surrogate  objective  function  for  the  target  function  on  each 
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step  of  the  procedure  (similar  to  the  E  step  of  the  EM).  Maxi¬ 
mizing  the  surrogate  objective  function  drives  the  target  func¬ 
tion  in  the  desired  direction.  Thus,  a  difficult  maximization 
problem  is  replaced  by  a  series  of  simpler  ones.  Unfortunately, 
there  is  no  universal  recipe  on  how  to  find  an  appropriate  sur¬ 
rogate  objective  function.  The  idea  of  the  QEM  is  to  obtain  a 
surrogate  objective  function  by  generalizing  the  one  inherent 
in  the  EM  procedure  so  that  it  does  not  depend  on  the  miss¬ 
ing  data  formulation  of  the  model.  Practically,  the  QEM  ap¬ 
proach  for  semiparametric  models  is  based  on  a  derivation  of 
the  “distribution-free” E  step  for  the  EM  algorithm, constructed 
for  the  PH  mixture  model  (3.5).  More  formally,  we  consider  the 
QEM  procedure  for  cure  models.  The  following  statement  is  the 
key  result  underlying  the  QEM  construction. 


where  Ilk  =  U"=j u  A 1 is  the  risk  set  at  f*,  D*  is  the  num¬ 
ber  of  failures  at  fy,  and 


G(Fi\-u) 


Y'iFilij)- 

y(Fi\-ij)  " 

1  + 

y'iFA-ij) 

_Y'(FrK_,Vij)  j 

rUV.I-yJ-rCOI-y) 

o, 


'•*-1  > 


j  zCi,  i  <rK  —  1, 
j  S  Vi,  i<rK- 1, 

j  e  VrK ,  i  =  rK, 

j  e  Ci ,  i  >rK, 

(3.21) 


Proposition  3.1  (Tsodikov  2003).  Let  t  be  the  observed 
event  time  under  independent  censoring  and  let  c  be  the  ob¬ 
served  censoring  indicator  (c  =  1 ,  if  a  failure,  and  c  =  0,  other¬ 
wise).  Under  the  PH  mixture  model, 

G(t|)  =  y[F(f)|]  =  E{[F(Or()}. 

and  if  F\x)  >  0,  the  conditional  expectation  of  M,  given  the 
observed  event,  is  given  by 

E{M(.)|.,r,c}  =  0[F(r)|.,c], 

where 

©[x|-,  c)  =  c  +  x  (3.19) 

y{c)(x  10 


where  (•)//  stands  for  (p,  z */).  The  fact  that  the  contributions 
of  the  observations  i  j  to  the  likelihood  as  well  as  0  at,  or  af¬ 
ter  the  last  failure  (i  >  r^),  are  different  from  their  counter¬ 
parts  before  the  last  failure  (/  <  —  1)  is  due  to  the  restriction 

Fi  =  0,  i  =  rn , . . . ,  n,  representing  the  fact  that  F  is  a  proper 
survival  function.  This  distinction  is  not  made  if  the  model  is  a 
general  NTM  with  an  unrestricted  F . 

It  is  interesting  to  note  that  the  score  equations  have  the  form 
of  the  Nelson-Aalen-Breslow  estimator  for  the  PH  model  with 
the  usual  predictor  replaced  by  0.  Because  0  depends  on  F , 
an  iteration  procedure  is  needed  to  satisfy  (3.20).  Given  P,  iter¬ 
ations  with  respect  to  F  can  be  carried  out  as  follows: 

•  At  the  kth  iteration ~F*k\  compute  0^  for  each  subject. 

•  Update  F ^  by  using  the  Nelson-Aalen-Breslow 

estimator  (3.20). 


where  y^c\x |-)  =  dcy(x\-)/dxc,  c  =  0,1,...,  y(0)(*|9  = 
y(x\ •). 

The  preceding  result  indicates  that  the  E  step  can  be  con¬ 
structed  using  the  first  two  derivatives  of  the  NTM-generating 
function  y  without  any  knowledge  or  even  existence  of  the 
mixing  random  variable  M.  Specification  of  an  algorithm  for 
a  particular  model  requires  evaluation  of  0  for  the  model.  For 
example,  for  the  models  discussed  in  the  previous  section,  we 
have 


•  Improper  PH  model:  From  (3.1)  or  (3.7),  0(x|-,c)  = 
c  +  0(-)*. 

•  PHPH  model:  From  (3.11),  ©(*|-,c)  =  0(-)»?(-)x’,(')  + 
ctj(-). 

Cure  models  require  a  correction  of  ©  at  the  last  observa¬ 
tion  [see  (3.21)].  Let  C(j  =  1,  j  6  £>/,  and  Cij  =  0,  j  €  Ci. 
Let  Tfc,  k  =  1, ...» K,  be  the  time  points  in  ascending  order 
where  failures  occur  ( V  is  not  empty).  Denote  by  r*  the  rank 
of  Xk  in  the  set  {r,  }.  The  ranks  r*  locate  the  points  r k  on  the  set 
fa  :  x k  =  trfc}.  By  definition,  we  set  r^+i  =  n  4*  1  and  to  =  0. 
Imputation  of  the  mixing  variable  M  using  (3.19)  results  in 
the  score  equations  for  the  cumulative  hazard  H  corresponding 
to  F , 


A  Hk  = 


Dk : 

E0-6^©(F,|/3,z0)’ 


(3.20) 


It  can  be  shown  (Tsodikov  2003)  that  if  0(x  |  ■)  is  a  nondecreas¬ 
ing  function  of  x,  or  if  y  is  a  PH  mixture  model  (a  stronger 
assumption),  then  each  iteration  described  previously  improves 
the  likelihood.  Also,  this  assumption  makes  the  preceding  pro¬ 
cedure  a  member  of  the  MM  family  (Lange  et  al.  2000),  and  the 
convergence  properties  of  the  algorithm  follow  from  the  general 
MM  theory. 

There  are  many  ways  to  build  a  particular  model  fitting  al¬ 
gorithm  based  on  the  principles  described  previously,  and  this 
depends  on  how  the  maximization  with  respect  to  P  is  incor¬ 
porated  into  the  procedure.  One  way  would  be  to  use  partial 
likelihood  and  a  setup  similar  to  the  EM  algorithm  for  frailty 
models  (Nielsen  et  al.  1992).  Such  an  EM  algorithm  was  used 
in  Sy  and  Taylor  (2000)  and  Peng  and  Dear  (2000)  to  fit  the 
two-component  mixture  model.  As  simple  as  the  setup  of  the 
procedure  with  a  binary  distribution  for  M  might  be,  its  gen¬ 
eralization  for  a  wider  family  of  PH  mixture  models  or  NTM’s 
is  problematic.  The  key  to  a  straightforward  setup  of  the  pro¬ 
cedure  is  not  to  use  the  EM  principle  on  P,  as  it  would  require 
working  with  the  distribution  of  M . 

We  find  the  profile  version  of  the  QEM  algorithm  particu¬ 
larly  easy  and  straightforward  to  work  with.  The  profile  QEM 
algorithm  outperformed  conventional  maximization  procedures 
(Sec.  3.2.1)  and  the  frailty  EM  algorithm  based  on  partial  likeli¬ 
hood  when  used  to  fit  a  semiparametric  proportional  odds  (PO) 
model.  Also,  it  was  shown  to  be  faster  than  a  directions  set 
method  used  to  fit  a  parametric  PO  model. 
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3.3  Example:  Prostate  Cancer  Data 


The  study  was  carried  out  using  data  on  1,100  patients 
with  clinically  localized  prostate  cancer  who  were  treated  with 
three-dimensional  conformal  radiation  therapy  at  the  Memo¬ 
rial  Sloan-Kettering  Cancer  Center  (Zaider  et  al.  2001).  The 
patients  were  stratified  by  radiation  dose  (group  1,  <67.5  Gy; 
group  2,  67.5-72.5  Gy;  group  3,  72.5-77.5  Gy;  group  4,  77.5- 
87.5  Gy)  and  prognosis  category  [favorable,  intermediate,  and 
unfavorable  as  defined  by  pretreatment  prostate  specific  antigen 
(PSA)  measurement  and  Gleason  prognostic  score].  A  relapse 
was  recorded  when  tumor  recurrence  was  diagnosed  or  when 
three  successive  PSA  elevations  were  observed  from  a  post¬ 
treatment  nadir  PSA  level.  PSA  relapse-free  survival  was  used 
as  the  primary  endpoint.  There  were  no  failures  observed  in  pa¬ 
tients  with  favorable  prognosis  in  dose  group  4.  For  that  reason 
this  group  of  patients  is  not  included  in  the  data  analysis. 

We  applied  the  PHPH  extended  hazard  model  (3.1 1)  with  z 
specified  by  indicator  dummy  variables  that  code  the  four  dose 
groups  and  the  three  prognostic  categories.  In  the  unrestricted 
model  dose  and  summary  prognosis  may  each  have  an  effect 
on  long-term  survival  through  0(z)  and  on  short-term  survival 
through  t](z).  The  hypothesis  of  no  short-term  effect  of  prog¬ 
nostic  category  (proportional  hazards  with  respect  to  prognosis) 
is  not  rejected  (p  =  .49)  by  the  likelihood  ratio  test.  This  obser¬ 
vation  justifies  the  use  of  the  PH  model  for  the  effect  of  the 
prognostic  category.  However,  the  total  radiation  dose  appears 
to  have  a  significant  ( p  <  .0001)  short-term  effect  on  survival 
of  patients  with  prostate  cancer,  indicating  nonproportionality 
of  the  effect  of  dose.  Resorting  to  our  mechanistic  interpreta¬ 
tion  of  (3.1 1),  we  may  speculate  that  the  progression  time  dis¬ 
tribution  varies  with  radiation  dose  while  being  essentially  the 
same  for  all  prognostic  categories.  Semi  parametric  estimates  of 
the  mean  values  for  the  distributions  1  —  y(F\z)  are  equal  to 
1,279,  867,  803,  and  498  days  for  dose  groups  1,  2,  3,  and  4, 
respectively.  Estimates  of  the  hazard  ratio  for  the  short-term  ef¬ 
fect  of  each  dose  group, 


log  k(F|  dose  group/) 
log  y  ( F  | dose  group  4) 


B  (dose  group  /), 


show  a  monotonically  increasing  (with  dose)  short-term  risk: 
.188  with  confidence  limits  (.038,  1.538)  for  dose  group  1, 
.415  (.065,  .765)  for  dose  group  2,  .474  (.024,  1.224)  for  dose 
group  3,  and  1 .000  for  dose  group  4. 

Semiparametric  estimates  of  the  probability  of  cure  are  given 
in  Table  1.  For  comparison,  we  also  present  parametric  esti¬ 
mates  obtained  with  F  specified  by  a  two-parameter  gamma 
distribution.lt  is  shown  in  Table  1  that  the  two  estimates  appear 
to  be  quite  close  to  each  other.  This  analysis  indicates  that,  in 
terms  of  cure  rate,  dose  escalation  has  a  significant  positive  ef¬ 
fect  only  in  the  intermediate  and  unfavorable  groups.  It  is  also 
found  that  progression  time  is  inversely  proportional  to  dose, 
which  means  that  patients  recurring  in  higher  dose  groups  have 
shorter  recurrence  times,  yet  these  groups  have  better  long-term 
survival.  One  possible  explanation  for  this  seemingly  illogical 
observation  lies  in  the  fact  that  less  aggressive  tumors  poten¬ 
tially  recurring  after  a  long  period  of  time  are  cured  by  higher 
doses  and  do  not  contribute  to  the  observed  time  pattern  of  tu¬ 
mor  relapse.  As  a  result  tumors  in  higher  dose  groups  are  less 
likely  to  recur;  however,  if  they  do,  they  tend  to  recur  earlier. 


Table  1.  Estimates  of  the  Probability  of  Cure 
(semiparametricparametric)  and  95%  Semiparametric  Likelihood  Ratio 
Confidence  Intervals  (in  parentheses)  as  Estimated  Using  Multivariate 
Semiparametric  Regression  Analysis  Based  on  ( 3 . 1 1). 


Prognostic 

category 

Dose  group 

1 

2 

3 

4 

Favorable 

.78/.80 

.79/.74 

.88/.87 

1.00* 

(.55,  .95) 

(.68  ,.92) 

(.80,  .97) 

- 

Intermediate 

.37/.25 

.53/. 51 

.67/.58 

.79/74 

(.21,. 55) 

(.41  ,.64) 

(.58,78) 

(.68,  .87) 

Unfavorable 

.02/.00 

.35/.27 

.46/.33 

.60/.64 

(.00,. 11) 

(.25,  .45) 

(.38,. 56) 

(.45,74) 

*The  estimate  of  the  probability  of  cure  is  set  equal  to  1  because  no  failures  are  observed 
in  this  group  of  patients. 


4.  BAYESIAN  INFERENCE 
4.1  Parametric  Cure  Rate  Model 

Ibrahim,  Chen,  and  Sinha  (2001a)  presented  a  comprehen¬ 
sive  treatment  of  Bayesian  approaches  for  the  cure  rate  model 

G(0  =  expMF(f)}.  (4.1) 

We  review  some  of  their  work  here.  We  let  the  covariates  de¬ 
pend  on  0  through  the  relationship  0  =  exp(x'/l),  where  x  is  a 
p  x  1  vector  of  co variates  and  , . . . ,  /$p )  is  a  p  x  1  vector 

of  regression  coefficients.  Proceeding  from  the  biological  inter¬ 
pretation  discussed  in  Section  1 ,  we  can  now  construct  the  like¬ 
lihood  function  (see  Chen  et  al.  1999;  Ibrahim  et  al.  2001a)  in 
a  typical  setting.  Suppose  we  have  n  subjects  and  let  denote 
the  number  of  surviving  clonogenic  tumor  cells  for  the  i  th  sub¬ 
ject.  Further,  we  assume  that  the  N,-’s  are  i.i.d.  Poisson  random 
variables  with  mean  0,  i  =  1 , 2, . . . ,  n.  We  emphasize  here  that 
the  N{ ’s  are  not  observed  and  can  be  viewed  as  latent  variables 
in  the  model  formulation.  Further,  suppose  Z\\,Z{ 2, . . . , 
are  the  i.i.d.  progression  times  for  the  N(  cells  for  the  ith 
subject,  which  are  unobserved,  and  all  have  proper  cumulative 
distribution  function  F(-),  /  =  1,2,  ...,n.  We  denote  the  in¬ 
dexing  parameter  (possibly  vector  valued)  by  and  thus  write 

and  F(-|^)  =  1  -  F(-|^r).  For  example,  if  F(-|^)  cor¬ 
responds  to  a  Weibull  distribution,  then  f  =  (a,  X)',  where  a 
is  the  shape  parameter  and  X  is  the  scale  parameter.  Let  y,*  de¬ 
note  the  survival  time  for  subject  /,  which  may  be  right  cen¬ 
sored,  and  let  v*  denote  the  censoring  indicator,  which  equals 
1  if  y,*  is  a  failure  time  and  0  if  it  is  right  censored.  The  ob¬ 
served  data  are  D0bs  =  (n>  y*  v)>  where  y  =  (yi,  yi, . . . ,  yn)f 
and  v  =  (i>i ,  V2, . . . ,  vn)\  Also,  let  N  =  (N\ ,  N2, . . . ,  Nn)f.  The 
complete  data  are  given  by  D  =  (n,  y,  v,  N),  where  N  is  an  un¬ 
observed  vector  of  latent  variables.  Throughout  the  remainder 
of  this  section,  we  will  assume  a  Weibull  density  for  f(yi\ir), 
so  that  f(y\f)  =  ay“_1  exp{X  -  yaexp(X)}.  When  covariates 
are  included  we  have  a  different  cure  rate  parameter,  9{ ,  for  each 
subject,  /  =  1,2 Letx^  =  (*,- 1, . . . ,  Xip)  denote  the  p*  1 
vector  of  covariates  for  the  /th  subject.  We  relate  6  to  the  co¬ 
variates  by  0i  =  0(xj/?)  =  exp(x^),  so  that  the  cure  rate  for 
subject  /  is  exp(-ft)  =  exp(-  exp (x|0)),  /  =  1, 2, . . . ,  n.  This 
relationship  between  0,*  and  is  equivalent  to  a  canonical  link 
for  $i  in  the  setting  of  Poisson  regression  models.  With  this  re- 
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lation  we  can  write  the  complete  data  likelihood  of  (P,  if)  as 


x  exp 


^2  [Nix'jP  -  log  (Nil)  -  exp(x-jS)] 

i=l 


,  (4.2) 


where  D  —  (n,y,  X,  v,N),  X  is  the  n  x  p  matrix  of  covari¬ 
ates,  f(yt \f)  is  the  Weibull  density  given  previously,  and 
7(yi  |  VO  =  exp(-y?  exp(X)). 

Chen  et  al.  (1999)  discussed  classes  of  noninformative priors 
as  well  as  the  class  of  power  priors  for  (P,  f)  and  exam¬ 
ined  some  of  their  properties.  Consider  the  joint  noninfor¬ 
mative  prior  n(P,f)  oc  7r(f ),  where  f  =  (a,  X)'  are  the 
Weibull  parameters  in  f(y\if).  This  noninformative  prior  im¬ 
plies  that  p  and  if  are  independent  a  priori  and  n(fi)  oc  1 
is  a  uniform  improper  prior.  We  will  assume  throughout  this 
section  that  n (ir)  =  7r(a|<$o>  to)7r(X),  where  7r(a|5o,to)  a 
ct8°~]  exp(— roa)  and  <5o  and  To  are  two  specified  hyperpara¬ 
meters.  Although  several  choices  can  be  made,  we  will  use  a 
normal  density  for  n(X).  With  these  specifications  the  poste¬ 
rior  distribution  of  (/?,  if)  based  on  the  observed  data  D0 bs  = 
(n,y,  X,  v)  is  given  by 

n(P,  f  |A,bs) «  (X! L (P’  *|0)V(a|So,  to)^(X),  (4.3) 

'  N  ' 


where  the  sum  in  (4.3)  extends  over  all  possible  values  of  the 
vector  N.  Chen  et  al.  (1999)  gave  conditions  concerning  the 
propriety  of  the  posterior  distribution  in  (4.3)  using  the  non¬ 
informative  prior  n(p,if)  oc  7t(if).  This  enables  us  to  carry 
out  Bayesian  inference  with  improper  priors  for  the  regression 
coefficients  and  facilitates  comparisons  with  maximum  likeli¬ 
hood.  However,  under  the  improper  priors  n(py  if)  oc  n  (if),  the 
mixture  model  in  (1.2)  always  leads  to  an  improper  posterior 
distribution  for  P  as  shown  in  Chen  et  al.  (1999). 

Ibrahim  and  Chen  (2000)  described  a  general  class  of  infor¬ 
mative  priors  called  the  power  priors  for  conducting  Bayesian 
inference  in  the  presence  of  historical  data.  We  now  examine 
the  power  priors  for  (P,  if).  Let  no  denote  the  sample  size  for 
the  historical  data,  let  yo  be  an  no  x  1  of  right-censored  fail¬ 
ure  times  for  the  historical  data  with  censoring  indicators  vo, 
let  No  be  the  unobserved  vector  of  latent  counts  of  clonogenic 
cells,  and  let  Xo  be  an  no  *  P  matrix  of  covariates  correspond¬ 
ing  to  yo.  Let  Do  =  (no,  yo,Xo,  vo,No)  denote  the  complete 
historical  data.  Further,  let  jto(P,  if)  denote  the  initial  prior  dis¬ 
tribution  for  (P ,  if) .  A  beta  prior  is  chosen  for  ao,  leading  to  the 
joint  power  prior  distribution 


7r(P,if,ao\Do}0bs) 


where  L(P,if\Do)  is  the  complete  data  likelihood  given 
in  (4.2)  with  D  being  replaced  by  the  historical  data  Do  and 
Do,obs  =  (noi  yo,  Xo,  vo).  We  take  a  noninformative  prior  for 
7to(P,ir),  such  as  no(P,  if)  oc  no(if),  which  implies  no(P) 
oc  1 .  For  if  =  (a,  X)f  we  take  a  gamma  prior  for  a  with  small 
shape  and  scale  parameters  and  an  independent  normal  prior 


for  X  with  mean  0  and  variance  co-  Also,  ( yo ,  ^o)  are  speci¬ 
fied  prior  parameters.  The  prior  in  (4.4)  does  not  have  a  closed 
form  but  has  several  attractive  properties.  First,  we  note  that  if 
no (P,  if)  is  proper,  then  (4.4)  is  guaranteed  to  be  proper.  Fur¬ 
ther,  (4.4)  can  be  proper  even  if  no(P,  if)  is  improper.  Chen 
etal.  (1999)  characterized  the  propriety  of  (4.4)  when  no(P,  if) 
is  improper.  In  addition,  they  showed  that  the  power  prior  for  P 
based  on  the  model  (1 .2)  will  always  lead  to  an  improper  prior 
as  well  as  an  improper  posterior  distribution. 


4.2  Semiparametric  Cure  Rate  Model 

In  this  section  we  consider  a  semiparametric  version  of 
the  parametric  cure  rate  model  discussed  in  the  previous  sec¬ 
tion.  Following  Ibrahim  et  al.  (2001a),  Chen,  Harrington,  and 
Ibrahim  (2001),  and  Chen  and  Ibrahim  (2001),  we  construct 
a  finite  partition  of  the  time  axis,  0  <  s\  <  •  *  •  <  $/,  with 
sj  >  yi  for  all  i  =  1, 2, . . . ,  n.  Thus,  we  have  the  J  intervals 
(0, $i],  (s\,S2], ...» (jy-i,  Jy],  and  we  assume  that  the  hazard 
for  F(y)  is  equal  to  Xj  for  the  j  th  interval,  j  =  1,2 , 7, 
leading  to 

Jr*0’|A.)  =  1  -exp|— Ay(y-sy-i)  — -jg_i)J, 

(4.5) 

where  X  =  (X\, .  ..,Xj)f.  We  note  that,  when  7  =  1,  F*(y\X\) 
reduces  to  the  parametric  exponential  model.  With  this  assump¬ 
tion  the  complete  data  likelihood  can  be  written  as 


L(P, X\D) 


=nriexP 


-  Vi)Vij 


1=1  .M 


~sj- i) 


j- 1 

+  Z\(%-J*-i) 

g-\ 


X 


n  n<w<*  exp 


;=i  j= i 


xj(yi  -sj-i) 


7-1 

+  Xf,(s%  —  5g_i) 
g=l 


x  exp 


n 

[fyx'iP  -  log (77/!)  -  exp(x^)] 

i=i 


(4.6) 


where  Vij  =  1  if  the  i  th  subject  failed  or  was  censored  in  the  j  th 
interval,  and  0  otherwise.  The  model  in  (4.6)  is  a  semiparamet¬ 
ric  version  of  (4.2)  in  which  the  degree  of  the  nonparame tricity 
is  controlled  by  7.  Because  the  estimation  of  the  cure  rate  pa¬ 
rameter  0  could  be  highly  affected  by  the  nonparametric  nature 
of  F*(y\X),  it  may  be  desirable  to  choose  small  to  moderate 
values  of  7  for  cure  rate  modeling.  In  practice,  we  recommend 
doing  analyses  for  several  values  of  7  to  see  the  sensitivity  of 
the  posterior  estimates  of  the  regression  coefficients.  The  semi¬ 
parametric  cure  rate  model  (4.6)  is  quite  flexible,  as  it  allows 
us  to  model  general  shapes  of  the  hazard  function,  as  well  as 
choose  the  degree  of  parametricity  in  F*(y  |X)  through  suitable 
choices  of  7.  Again,  because  N  is  not  observed,  the  observed 
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data  likelihood,  L(j3,  X|Dobs),  is  obtained  by  summing  out  N 
from  (4.6)  as  in  the  previous  section.  Also,  power  priors  for  this 
model  can  be  constructed  in  a  similar  fashion  as  in  the  previ¬ 
ous  section.  We  refer  the  reader  to  Ibrahim  et  al.  (2001a),  Chen 
et  al.  (2001),  and  Chen  and  Ibrahim  (2001)  for  more  details. 

4.3  An  Alternative  Semiparametric  Cure  Rate  Model 


A  crucial  issue  with  cure  rate  modeling,  and  semiparamet¬ 
ric  survival  models  in  general,  is  the  behavior  of  the  model 
in  the  right  tail  of  the  survival  distribution.  In  these  models 
there  are  typically  few  subjects  at  risk  in  the  tail  of  the  sur¬ 
vival  curve  after  sufficient  follow-up;  therefore,  estimation  of 
the  cure  rate  can  be  quite  sensitive  to  the  choice  of  the  semi¬ 
parametric  model.  Thus,  there  is  a  need  to  carefully  model  the 
right  tail  of  the  survival  curve  and  allow  the  model  to  be  more 
parametric  in  the  tail,  while  also  allowing  the  model  to  be  non- 
parametric  in  other  parts  of  the  curve.  Ibrahim,  Chen,  and  Sinha 
(2001b)  constructed  such  a  model  by  defining  a  smoothing 
parameters,  0  <  k  <  1,  that  controls  the  degree  of  parametric- 
ity  in  the  right  tail  of  the  survival  curve  and  does  not  depend 
on  the  data.  Specifically,  the  prior  for  Xj  depends  on  s,  such 
that  the  model  converges  to  a  parametric  model  in  the  right  tail 
of  the  survival  curve  as  j  oo.  By  an  appropriate  choice  of  s, 
one  can  choose  a  fully  nonparametric  model  or  a  fully  para¬ 
metric  model  for  the  right  tail  of  the  survival  distribution.  Also, 
k  will  allow  some  control  over  the  degree  of  parametricity  in  the 
beginning  and  middle  part  of  the  survival  distribution.  A  more 
parametric  shape  of  the  model  in  the  right  tail  facilitates  more 
stable  and  precise  estimates  of  the  parameters.  This  approach 
is  fundamentally  very  different  from  previous  approaches  for 
semiparametric  Bayesian  survival  analysis,  which  primarily  fo¬ 
cus  on  specifying  a  prior  process  with  a  mean  function  and  pos¬ 
sibly  a  smoothing  parameter,  in  which  the  posterior  properties 
of  both  of  them  depend  on  the  data. 

Let  F0(r|^o)  denote  the  parametric  survival  model  chosen 
for  the  right  tail  of  the  survival  curve  and  let  Ho(t)  denote  the 
corresponding  cumulative  baseline  hazard  function.  Now  take 
the  X/s  to  be  independent  a  priori,  each  having  a  gamma  prior 
distribution  with  mean 


IXj  =  E(kj\f0): 


Ho(sj)  -  Hp (.9 j  - 1 ) 
Sj  ~  Sj~  1 


and  variance 


<rj  =  VaT(\.j\f0,K)  =  HjKJ,  (4.8) 

where  0  <  k  <  1  is  the  smoothing  parameter.  As  k  ->  0, 
o'?  ->  0,  so  that  small  values  of  k  imply  a  more  parametric 
model  in  the  right  tail.  In  addition,  as  j  — >  oo,  orj  — ►  0,  im¬ 
plying  that  the  degree  of  parametricity  is  increased  at  a  rate 
governed  by  k  as  the  number  of  intervals  increases.  This  prop¬ 
erty  also  implies  that,  as  j  -*  oo,  the  survival  distribution  in  the 
right  tail  becomes  more  parametric  regardless  of  any  fixed  value 
of  k.  The  properties  of  this  model  are  attractive.  For  exam¬ 
ple,  if  FoH^o)  is  an  exponential  distribution,  then  Fo(y|i^o)  = 
1  -  exp(-^oy),  so  that  [ij  =  and  <jj  =  t^o kK  If  FoH^o) 
is  a  Weibull  distribution,  then  Fo(y|^0)  =  1  -  exp(-yo/*°), 
=  (<*o>  Yo)\  so  that 

(*?-<-<>  2  >)  ,• 

M/  =  yo— - - -  and  of  =  y0— - - 

Sj-Sj-l  Sj-Sj-l 


Several  properties  of  this  model  are  now  given,  which  are 
proved  in  Ibrahim  et  al.  (2001  a, b). 

Property  4.1.  Assume  that  (sj  +  Sj-\)/2  — ►  t  as  Sj  — 
Sj- 1  — >  0.  Then,  for  any  j,  according  to  this  prior  process, 
ho(t)  as  Sj  -  Sj- 1  ->  0,  where  ho(t)  = 

For  example,  if  Fo(y\iro)  =  1  -  exp(-V'o)'),  then 
EikjWo)  =  V'O  regardless  of  the  choice  of  s\,S2,...,sj.  If 
^ofyllfro)  =  1  “  exp(-yoy“°),  then  E(Xj\f0)  -r  yoao?“°-1 
as  sj  —  Sj- 1  ->  0.  This  assures  that,  as  j  becomes  large  and 
Sj  -  Sj- 1  -*■  0,  this  prior  process  approximates  any  prior 
process  with  prior  mean  ho(t)  defined  on  the  progression  time 
hazard  h*(t  |X)  corresponding  to  (4.5). 

Property  4.2.  Let  /^(ylX)  =  exp(-0F*(y|X)).  Then 
J*p(y\X)  Fp(y|^o)  as  k  0,  where  Fp(y|f0)  = 

exp(-0Fo(y|fo)). 

Property  4.3.  Let  /*(y|X)  =  ^F*(y|X)  and  let  h*  (y|X)  = 
9f*{y\X)  denote  the  corresponding  hazard  function.  Then 
h*p(y |X)  6f0(y\1r0)  as  k  0,  where  My\f0)  = 

^oOW- 

We  now  give  jointprior  specifications  for  the  semiparametric 
model  in  (4.7)  and  (4.8).  We  specify  a  hierarchical  model  and 
first  consider  a  joint  (improper)  noninformative  prior  distribu¬ 
tion  for  Q3,  X,  f  0),  given  by 

=n(P)n(X\jr0)7r(ir0) 

‘  J 

exn(P)  l>Wo)  (4.9) 

-j= i 

We  take  each  n(Xj\ir0)  to  be  independent  gamma  densities 
with  mean  \ij  and  variance  crj.  If  FoOl^o)  is  an  exponential 
distribution,  then  t^o  is  a  scalar,  and  we  specify  a  gamma  prior 
for  it;  that  is,  7t(t^o)  ^q0-1  exp(— tq^o),  where  fo  and  ro  are 

specified  hyperparameters.  If  Fo(*  |^q)  is  a  Weibull  distribution, 
then  =  (ao,  Ko)/-  In  this  case  we  take  a  prior  of  the  form 

7t(ir0)  =  7T(a0iyo) 

excel*0  1  exp(-r0oao)yoy°  '  exp(-ryoyo)>  (4.10) 

where  £*0,  r<*0,  fyo,  and  ryo  are  specified  hyperparameters.  For 
P  we  consider  a  uniform  improper  prior.  Ibrahim  et  al.  (2001a, 
2001b)  established  conditions  for  the  propriety  of  the  joint  pos¬ 
terior  distribution  of  03,  X,  when  using  an  exponential  dis¬ 
tribution  or  a  Weibull  distribution  for  jPo(*  ll^o) •  addition, 
they  developed  the  power  prior  for  this  model. 

4.4  Multivariate  Cure  Rate  Model 

It  is  often  of  interest  to  jointly  model  several  types  of  fail¬ 
ure  time  random  variables  in  survival  analysis,  such  as  time  to 
cancer  relapse  at  two  different  organs,  times  to  cancer  relapse 
and  death,  times  to  first  and  second  infection,  and  so  forth.  In 
addition,  these  random  variables  typically  have  joint  and  mar¬ 
ginal  survival  curves  that  “plateau”  beyond  a  certain  period  of 
follow-up;  therefore,  it  is  of  great  importance  in  these  situations 
to  develop  a  joint  cure  rate  model  for  inference.  There  does 
not  appear  to  be  a  natural  multivariate  extension  of  the  mix¬ 
ture  model  in  (1 .2).  Even  if  such  an  extension  were  available,  it 


1074 


Journal  of  the  American  Statistical  Association,  December  2003 


appears  that  a  multivariate  mixture  model  would  be  extremely 
cumbersome  to  work  with  from  a  theoretical  and  computational 
perspective.  As  an  alternative  to  a  direct  multivariate  extension 
of  (1.2),  Chen,  Ibrahim,  and  Sinha  (2001)  proposed  a  multi¬ 
variate  generalization  of  (4.1),  called  the  multivariate  cure  rate 
model  This  model  proves  to  be  quite  useful  for  modeling  mul¬ 
tivariate  data  in  which  the  joint  failure  time  random  variables 
have  a  surviving  fraction  and  each  marginal  failure  time  ran¬ 
dom  variable  also  has  a  surviving  fraction.  To  induce  the  cor¬ 
relation  structure  between  the  failure  times,  Chen  et  al.  (2001) 
introduced  a  frailty  term  (Clayton  1978;Hougaard  1986;  Oakes 
1989),  which  is  assumed  to  have  a  positive  stable  distribution. 
A  positive  stable  frailty  results  in  a  conditional  proportional 
hazards  structure  (i.e.,  given  the  unobserved  frailty).  Thus,  the 
marginal  and  conditional  hazards  of  each  component  have  a 
proportional  hazards  structure  and  remain  in  the  same  class  of 
univariate  cure  rate  models. 

For  clarity  and  ease  of  exposition,  we  will  focus  our  discus¬ 
sion  on  the  bivariate  cure  rate  model,  as  extensions  to  the  gen¬ 
eral  multivariate  case  are  quite  straightforward.  The  bivariate 
cure  rate  model  of  Chen  et  al.  (2001)  can  be  derived  as  follows. 
Let  Y  =  (Fj ,  Yz)'  be  a  bivariate  failure  time,  such  as  Y\  =  time 
to  cancer  relapse  and  Y2  =  time  to  death,  or  Y\  =  time  to  first 
infection  and  F2  =  time  to  second  infection,  and  so  forth.  We 
assume  that  (Fi,  F2)  are  not  ordered  and  have  support  on  the 
upper  orthant  of  the  plane.  For  an  arbitrary  patient  in  the  pop¬ 
ulation,  let  N  =  (A/j ,  A^y  denote  latent  (unobserved)  variables 
for  (Fi ,  Y2 ),  respectively.  We  assume  throughout  that  Nk  has  a 
Poisson  distribution  with  mean  0*u;,  k  =  1,2,  and  ( N\ ,  N2 )  are 
independent,  given  w.  The  quantity  w  is  a  frailty  component  in 
the  model  that  induces  a  correlation  between  the  latent  variables 
(N\ ,  N2).  Here  we  take  w  to  have  a  positive  stable  distribution 
indexed  by  the  parameter  a ,  denoted  by  w  ~  Sa  ( 1 , 1 , 0) ,  where 
0  <  a  <  1  (see  Chen  et  al.  2001  for  more  details).  Although 
several  choices  can  be  made  for  the  distribution  of  it>,  the  posi¬ 
tive  stable  distribution  is  quite  attractive,  common,  and  flexible 
in  the  multivariate  survival  setting.  In  addition,  it  will  yield  sev¬ 
eral  desirable  properties. 

LetZ  i  =  {Zyu  Z2i)f  denote  the  bivariate  progression  time  for 
the  zth  clonogenic  cell.  The  random  vectors  Z, ,  i  =  1,2,..., 
are  assumed  to  be  independent  and  identically  distributed.  The 
cumulative  distribution  function  of  Zki  is  denoted  by  F*(r)  = 
1  -  Fidt),  k  =  1,2,  and  Fk  is  independent  of  (N\}  N2).  The 
observed  survival  time  can  be  defined  by  the  random  variable 
Yk  —  min{Zjti ,  0  <  i  <  AT*},  where  P(Zko  =  00)  =  1  and  AT* 
is  independent  of  the  sequence  Z*  1 ,  Zfa , . . . ,  for  k  =  1 , 2.  The 
survival  function  for  Y  =  (Y\ ,  given  w ,  and  hence  the  sur¬ 
vival  function  for  the  population,  given  w ,  is  given  by 

Fpop(yuy2\w) 

2 

=  Y\  [P(Nk  =  0)  +  P(Zk  1  >  y*, . ZkN  >  yk ,  Nk  >  1)] 

k=  1 

2 

=n 

*=1 
2 

=  ]"[exp{-u)0/t  +  9i<wFk(yk)} 

k=] 

=  exp{-w[0iFi(yi)  +  02^2()’2)]}.  (4.11) 


exp(-w0*)  +  2J  FkiykY 


(weky 


r! 


exp(—  wdk) 


where  P(Nk  =  0)  =  P(F*  =  oo)  =  exp(— 0*),  k  —  1,2.  The 
frailty  variable  w  serves  a  dual  purpose  in  the  model — it  in¬ 
duces  the  correlation  between  Y\  and  Y2  and  at  the  same  time 
relaxes  the  Poisson  assumption  of  N\  and  N2  by  adding  the 
same  extra  Poisson  variation  through  their  respective  means 
01  w  and  O2W. 

The  Laplace  transform  of  w  is  given  by  E(exp(-.yu;))  = 
exp(— sa).  Using  the  Laplace  transform  of  w,  a  straightforward 
derivation  yields  the  unconditional  survival  function 

^pop(yi.W)  =exp{-[0iFi(;yi)  +  02F2O’2)]“}-  (4.12) 

It  can  be  shown  that  (4.12)  has  a  proportional  hazards  struc¬ 
ture  if  the  covariates  enter  the  model  through  (0i,  02).  The  joint 
cure  fraction  implied  by  (4.12)  is  F pop(oo,  00)  =  exp(-[0j  + 
#2]*).  From  (4.12)  the  marginal  survival  functions  are  Fk(y)  = 
exp(— 0£(Fk(y))a),  k—  1,2.  Equation  (4. 12)  indicates  that  the 
marginal  survival  functions  have  a  cure  rate  structure  with 
probability  of  cure  exp(-0")  for  Y*,  k  =  1,2.  It  is  impor¬ 
tant  to  note  in  (4.12)  that  each  marginal  survival  function 
has  a  proportional  hazards  structure  as  long  as  the  covari¬ 
ates,  x,  only  enter  through  9k .  The  marginal  hazard  function 
is  given  by  aO®  fk(y)(Fk(y))a~li  with  attenuated  covariate 
effect  (0*00)°',  and  fk(y)  is  the  survival  density  correspond¬ 
ing  to  Fk(y).  This  property  is  similar  to  the  earlier  observa¬ 
tions  made  by  Oakes  (1989)  for  the  ordinary  bivariate  stable 
frailty  survival  model.  The  parameter  a,  0  <  a  <  l,isa  scalar 
parameter  that  is  a  measure  of  association  between  (Fi,  F2). 
Small  values  of  a  indicate  high  association  between  (Fj,  F2). 
As  a  1  this  implies  less  association  between  (F 1,  F2),  which 
can  be  seen  from  (4.12).  Following  Clayton  (1978)  and  Oakes 
(1989),  we  can  compute  a  local  measure  of  dependence,  de¬ 
noted  0*(yi ,  y2),  as  a  function  of  a.  For  the  multivariate  cure 
rate  model  in  (4.12),  0*(fi,  ti)  is  well  defined  and  is  given  by 

0* (yi ,  yi)  =  «-'■ 1  ( 1  -  a)  (<?i  F\  (y i )  +  e2F2 {y2)) +  1 .  (4. 1 3) 

We  see  that  0*(yi,  y2)  in  (4.13)  decreases  in  (yi,  y2).  That  is, 
the  association  between  (Fi,  F2)  is  greater  when  (Fi,  F2)  are 
small  and  the  association  decreases  over  time.  Such  a  prop¬ 
erty  is  desirable,  for  example,  when  Y\  denotes  time  to  relapse 
and  F2  denotes  time  to  death.  Finally,  we  mention  that  a  global 
measure  of  dependence  such  as  Kendall’s  r  or  the  Pearson  cor¬ 
relation  coefficient  is  not  well  defined  for  the  multivariate  cure 
rate  model  (4. 12)  because  no  moments  for  cure  rate  models  ex¬ 
ist  due  to  the  improper  survival  function. 

The  likelihood  function  for  this  model  based  on  n  subjects 
can  be  written  as 


L(fi,f\D) 


=  (n  fi  ^  (yki\fk)Nti  ~Vki  (Nkifkiyki  I  fk)Yu  ) 


x  exp 


£(A hi  lo g(u>/0*)  -  \og{Nki !)  -  Wi0k)  (4.14) 
/=! 


where  jr  =  (^1*^2)^  ^  =  (01.02),  A hi  is  the  number  of 
clonogens  for  the  zth  subject,  and  fk(yki\$k)  ^en_ 

sity  corresponding  to  /'Hy*;!^*)*  J  =  !»•••*«»  k  =  1,2. 
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Chen  et  al.  (2001)  used  a  Weibull  density  for  fk(yki\fk)>  so 
that  fk(y\fk)  =  exp{Ajt  -  exp (Xk)},  where  fk  = 

(&>  k  =  1  >  2.  To  construct  the  likelihood  function  of  the  ob¬ 
served  data,  we  integrate  (4.14)  with  respect  to  (N,  w)  assuming 
an  5a(l,  1, 0)  density  for  each  uj/,  denoted  by  fs(wi \a),  where 
N  =  (N\\, ...»  N\n,  N2\,...>N2»)  and  w  =  (w\,...,wn).  As 
before  we  incorporate  covariates  for  the  cure  rate  model  (4.12) 
through  the  cure  rate  parameter  0.  Let  x^  =  (*a,  . . . , *ip) 

denote  the  p  x  1  vector  of  covariates  for  the  ith  subject  and 
let  pk  =  (ftti,  Pkh  •  •  •»  PkpY  denote  the  corresponding  vec¬ 
tor  of  regression  coefficients  for  failure  time  random  vari¬ 
able  Yic ,  k  =  1,2.  We  relate  0  to  the  covariates  by  Oki  = 
=  exp (xj/?*),  so  that  the  cure  rate  for  subject  i  is 
exp(-^r)  =  exp(-exp(x-/?*))  for  i  =  1, 2, . . .  ,n  and  k=  1,  2. 
Letting  p  =  (fi\ ,  P2)\  Chen  et  al.  (2001)  showed  that  the  ob¬ 
served  data  likelihood  of  (p,  f,  a)  can  be  written  as 


L(P,  f ,  a  I  Dobs) 


=  (ad'+di  f]  n  exp(x-/9t)'j  f]  fl  fkiykiW kYkt 
V  k=\  ieVk  /  Lk=\i=\ 


n 

X  ]~[{[exp(x'/9|)Fi(yi,|f ,) 


i=l 


+  exp(x'/S2)^2(y2ilTJr2)] 


(or— 1  XviH-t^i) 


*  fl  {-l-^[exp(Xjj8,)Fi(yif|^i) 


+  exp(x'/?2)F20>2<#2)]  “  +  1 


V\i  v2i 


x  flexp(-(exp(x^1)F1(yi,#i) 
f=i 

+  exp(x'/J2)F2(y2,#2))“j.  (4.15) 

where  D*  consists  of  those  patients  who  failed  according  to 
Yk,k=  1,2,  D0bS  =  («,yi,y2>X,  v\9  v2),  X  is  the  n  x  p  ma¬ 
trix  of  covariates,  fk(yki\fk)  *s  Weibull,  and  Fk(yki\fk )  = 
exp(-yf)  exp(X*)). 

Chen  et  al.  (2001)  considered  a  joint  improper  prior  for 
( P ,  f,  a)  ==  (Pi,  P2,f\,  fl' °0  of  the  form 


n(p,ir,a)  =  Jr(/»1#/»2f  fuf2><*) 


oc  7i(f  i)7r(f2)I(0  <  oi  <  1) 

2 

=  ]~I7r(&,x*)/(0<a  < 

where  7(0  <  a  <  1)  =  1  if  0  <  a  <  1,  and  0  otherwise.  This 
prior  implies  that  P ,  ^ ,  and  a  are  independenta  priori,  (0  j ,  02) 
are  independenta  priori  with  an  improper  uniform  prior,  a  has 
a  proper  uniform  prior  over  the  interval  (0, 1),  and  (^j,  f2) 
are  independent  and  identically  distributed  as  n (fk)  a  priori. 
Chen  et  al.  (2001)assumed  that7r(£*,  A.*)  =  7r(&|vo,  ro)7r(*jO, 
where 

To)  OC^0_I  exp{-r0^}  and  n{kk)  oc  exp(-c0X^), 


and  <5o,  to,  and  co  are  specified  hyperparameters.  Chen  et  al. 
(2001)  and  Ibrahim  et  al.  (2001a)  gave  a  detailed  discussion  of 
the  computational  implementation  of  this  model. 

Detailed  real  data  examples  for  the  models  discussed  in  Sec¬ 
tions  4. 1—4.3  can  be  found  in  Ibrahim  et  al  (2001a).  The  Gibbs 
sampler  was  used  to  obtained  posterior  estimates  for  all  mod¬ 
els  discussed  in  this  section.  Details  of  the  Gibbs  sampling  al¬ 
gorithms  and  covergence  diagnostics  can  be  found  in  Ibrahim 
et  al.  (2001). 


5.  FUTURE  RESEARCH 

Future  areas  of  research  include  expanding  the  Bayesian 
model  to  include  co  variates  in  0  as  well  as  F(t).  Although  not 
investigated  here,  this  appears  to  be  a  natural  extension  of  the 
Bayesian  model  presented  in  Section  4.  A  more  general  regres¬ 
sion  model  is  currently  being  investigated.  The  results  reported 
in  Section  3  from  the  frequentist  perspective  also  encourage  this 
research  effort. 

The  class  of  models  given  by  expression  (1.8)  in  Section  1 
opens  a  new  avenue  in  parametric  (and  probably  semiparamet- 
ric;  see  Section  3)  survival  regression.  For  example,  by  incorpo¬ 
rating  a  predictor  into  the  parameter  i  we  obtain  a  proportional 
hazards  (PH)  model.  The  quantity  i  is  unobservable,  but  it  is 
reasonable  to  assume  that  i  is  proportional  to  the  observable 
tumor  volume.  If  one  predictor  is  incorporated  into  the  parame¬ 
ter  i  whereas  another  covariate  (e.g.,  fractional  radiation  dose) 
is  incorporated  into  some  other  parameter(s)  of  the  model,  the 
resulting  regression  counterpart  of  (1.8)  will  no  longer  be  the 
PH  model.  From  a  statistical  viewpoint  the  main  advantage  of 
this  generalization  of  the  clonal  model  of  tumor  recurrence  is 
that  it  can  suggest  the  structure  of  new  regression  models  to  be 
further  explored  by  statistical  methods.  Such  models  may  pro¬ 
vide  a  useful  means  of  searching  for  better  regimens  for  frac¬ 
tionated  radiotherapy. 

Another  promising  idea  is  to  explore  possible  links  between 
stochastic  models  proposed  for  the  natural  history  of  cancer  and 
for  posttreatment  cancer  survival;  this  approach  can  enrich  sta¬ 
tistical  inference  by  supplementing  the  analysis  of  patients’  sur¬ 
vival  with  additional  epidemiological  data  on  cancer  detection. 
We  believe  that  this  idea  may  dramatically  change  the  whole 
concept  of  modern  parametric  survival  analysis,  by  invoking 
mechanistically  motivated  models  for  the  joint  distribution  of 
covariates  at  the  time  of  diagnosis. 

Methods  for  model  diagnostics  especially  designed  for  the 
semiparametric  models  in  Section  3  remain  a  very  important 
issue  for  future  methodological  research.  The  same  is  equally 
true  for  the  regression  counterparts  of  the  two-component  mix¬ 
ture  model.  Justification  of  asymptotic  properties  of  the  esti¬ 
mate  represent  a  serious  challenge. 

6.  SOFTWARE  AND  DATA  ANALYSIS  FOR 
CURE  RATE  MODELS 

Prototype  software  implementing  the  estimation  procedures 
of  Section  3.2  was  programmed  in  Delphi  6.0  for  Windows 
95/98/2000/XP,  an  object-oriented  visual  development  en¬ 
vironment  based  on  the  Pascal  programming  language  by 
INPRISE™.  This  software  is  available  on  request  from  Alex 
Tsodikov.  Implementation  of  these  methods  in  high-level  lan¬ 
guages  such  as  R,  S,  or  SAS  appears  straightforward.  In  addi¬ 
tion,  Extensive  BUGS  software  is  available  for  the  cure  rate  and 
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other  related  models  given  in  the  book  by  Ibrahim  et  al .  (200 1  a). 
This  software  can  be  easily  downloaded  from  the  book’s  Web 
site  at  littp://www.merlot.uconn.edufaihchen/survbook.  BUGS 
code  for  several  models  and  datasets  is  available  from  this  Web 
site. 

In  the  semiparametric  context  a  cure  model  can  be  regarded 
as  a  convenient  reparameterization  of  its  counterpart  in  the 
noncure  form  [compare  the  traditional  form  of  the  PH  model 

_ Q 

G  =  F  with  its  cure  form  (3.1)].  A  display  of  the  Kaplan- 
Meier  estimator  for  the  data  may  indicate  a  noncure  situation  if 
the  curves  go  to  0  at  the  end  of  follow-up.  Fitting  a  cure  form 
model  to  such  data  may  result  in  numerical  instability  as  the  pa¬ 
rameter  coding  the  cure  rate  would  be  approaching  the  border 
of  the  parametric  space. 

Departures  from  proportionality  and  an  indication  for  the 
extended  hazard  models  can  be  explored  using  an  empirical 
display  of  the  function  F  (2.4)  or  its  more  sophisticated  ver¬ 
sions  as  discussed  in  Section  2.  Similar  techniques  for  model 
exploration  is  available  for  two-component  mixture  models 
(Tsodikov  2002).  More  work  is  needed  to  create  a  toolbox  of 
methods  for  model  building,  choice  and  diagnostics. 

7.  BIBLIOGRAPHICAL  NOTES 

Over  recent  years  the  BCH  model  has  been  developed  in 
many  different  directions.  In  particular,  the  literature  on  para¬ 
metric  inference  based  on  this  model  is  quite  voluminous. 
The  usual  practice  is  to  use  the  two-parameter  gamma  or  the 
Weibull  distribution  for  the  function  F(t)  in  formula  (1.4). 
Maximum  likelihood  inference  without  covariates  has  been 
discussed  in  the  context  of  right-censored  data  under  con¬ 
tinuous  follow-up  (Asselain,  Fourquet,  Hoang,  Myasnikova, 
and  Yakovlev  1994;  Hoang,  Tsodikov,  Yakovlev,  and  Asselain 
1995;  Yakovlev  1996;  Yakovlev  and  Tsodikov  1996);  discrete 
surveillance  design  (Tsodikov,  Asselain,  Fourquet,  Hoang, 
and  Yakovlev  1995),  and  doubly  censored  data  (Kruglikov, 
Pilipenko,  Tsodikov,  and  Yakovlev  1997).  A  version  of  the 
Hjort  goodness-of-fit  test  for  the  model  (1.4)  with  F(t)  rep¬ 
resented  by  the  gamma  distribution  was  described  in  Gregori 
et  al.  (2002).  Tsodikov  (1998a)  studied  the  asymptotic  effi¬ 
ciency  of  cure  rate  estimation.  Some  limiting  distributions  as¬ 
sociated  with  model  (4.1)  and  their  bivariate  counterparts  were 
described  in  the  articles  by  Klebanov,  Rachev,  and  Yakovlev 
(1993a)  and  Rachev,  Wu,  and  Yakovlev  (1995);  the  authors 
also  provided  estimates  of  the  convergence  rates.  Parametric 
and  semiparametric  regression  models  allowing  for  dissimilar 
effects  of  covariates  on  the  probability  of  cure  and  the  tim¬ 
ing  of  the  event  of  interest  were  extensively  explored  in  sev¬ 
eral  publications  (Asselain,  Fourquet,  Hoang,  Tsodikov,  and 
Yakovlev  1996;  Tsodikov  et  al.  1998b;  Myasnikova,  Asse¬ 
lain,  and  Yakovlev  2000;  Tsodikov  2002).  An  improper  pro¬ 
portional  hazards  model  was  studied  by  Tsodikov  (1998a, b) 
and  Chen  and  Ibrahim  (2001).  Time-dependent  risk  factors 
for  the  cure  probability  were  introduced  in  Tsodikov  et  al. 
(1997,  1998b)  and  Tsodikov  and  Muller  (1998).  Two-sample 
score  tests  for  long-term  and  short-term  effects  on  survival 
were  proposed  by  Broet  et  al.  (2001).  Statistical  inference  from 
the  Bayesian  prospective  was  discussed  by  Klebanov,  Rachev, 
and  Yakovlev  (1993b),  Chen  et  al.  (1999,  2001),  Ibrahim  and 
Chen  (2000),  Ibrahim  et  al.  (200 1  a, b),  and  Chen,  Ibrahim,  and 


Journal  of  the  American  Statistical  Association,  December  2003 

Lipsitz,  (2002).  Ibrahim  et  al.  (2001a)  devoted  an  entire  chap¬ 
ter  (chap.  5)  to  the  cure  rate  model  and  its  applications  in  their 
book.  Moreover,  recent  books  discussing  Gibbs  sampling  for 
this  model  and  related  models  are  also  given  in  Ibrahim  et  al. 
(2001a)  and  Chen,  Shao,  and  Ibrahim  (2000).  The  model  has 
successfully  been  applied  to  various  datasets  on  cancer  survival 
(Yakovlev  et  al.  1993;  Asselain  et  al.  1994,  1996;  Yakovlev 
1996;  Yakovlev  and  Tsodikov  1996;  Tsodikov  et  al.  1995, 
1997;  Tsodikov  1998a,  Tsodikov  et  al.  1998b;  Chen  et  al. 
1999;  Yakovlev  et  al.  1999;  Myasnikova  et  al.  2000;  Trelford, 
Tsodikov,  and  Yakovlev  2001 ;  Tsodikov 200 1 , 2002;  Tsodikov, 
Dicello,  Zaider,  Zorin,  and  Yakovlev  2001;  Zaider  et  al.  2001; 
Zorin,  Tsodikov,  Zharinov,  and  Yakovlev  2001). 
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Abstract:  The  bounded  cumulative  hazard  model  is  a  generalization  of  the 
proportional  hazard  model  with  several  distinct  advantages:  (1)  It  has  superior 
flexibility,  allowing  a  fit  to  data  where  the  the  proportional  hazards  assumption 
does  not  apply;  (2)  It  has  a  form  that  is  suitable  for  semiparametric  inference; 
and  (3)  It  may  offer  an  interpretation  in  terms  of  biologically  meaningful  pa¬ 
rameters.  In  this  paper  the  bounded  cumulative  hazard  model  is  discussed. 
Several  versions  of  the  model  are  applied  to  breast  cancer  recurrence  data  from 
the  Curie  Institute. 
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1.1  Introduction 

In  many  clinical  and  epidemiological  settings,  investigators  encounter  cause- 
specific  survival  curves  that  tend  to  level  off  at  a  value  strictly  greater  than  zero 
as  time  increases.  This  plateau  may  be  taken  as  an  indication  of  the  presence  of 
a  proportion  of  patients  for  whom  the  disease  under  study  will  never  recur.  One 
can  consider  such  patients  to  be  effectively  cured.  The  probability  of  (biological) 
cure,  variously  referred  to  as  the  cure  rate  or  the  surviving  fraction,  is  defined  as 
an  asymptotic  value  of  the  survival  function  G(t)  as  t  tends  to  infinity.  Let  X  be 
the  survival  time  with  cumulative  distribution  function  (c.d.f.)  G(t)  =  1  -G(t). 

The  existence  of  a  non-zero  surviving  fraction,  p ,  is  determined  by  the  be- 
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havior  of  the  hazard  function,  A  (£),  by  virtue  of  the  equality 


p  =  lim  G(t)  =  exp 

t — ^00 


(1.1) 


Whenever  p  >  0  (the  integral  in  (1.1)  converges),  the  underlying  survival  time 
distribution  is  said  to  be  improper.  Clearly,  A(u)  — »0asu--»ooifp>0  and 
the  limit  of  X(u)  (as  u  oo)  exists. 

Boag  (1949)  and  later  Berkson  and  Gage  (1952)  proposed  a  two-component 
(binary)  mixture  model  for  the  analysis  of  survival  data  when  a  proportion  of 
patients  is  cured.  Since  then,  the  binary  mixture-based  approach  has  become 
the  dominant  one  in  the  literature  on  cure  models  (Miller,  1981;  Mailer  and 
Zhou,  1996).  The  main  idea  behind  this  approach  is  that  any  improper  survival 
function  can  be  represented  as 

G(t)  =  E  {[<50(f)]M}  =  p  +  (1  -  p)Go(f),  (1.2) 

where  M  is  a  binary  random  variable  taking  on  the  values  of  0  and  1  with 
probability  p  and  1  —  p,  respectively,  with 


p  =  Pr{X  =  oo}, 

and  Go(£)  is  defined  as  the  survival  function  for  the  time  to  failure  conditional 
upon  ultimate  failure,  i.e. 


G0(t)  =  Pr{X  >  t\X  <  oo}.  (1.3) 

An  alternative,  but  equally  general,  representation  of  an  improper  survival 
time  distribution  can  be  obtained  by  assuming  that  the  cumulative  hazard 
A (t)  =  Jq  A (t)dt  has  a  finite  positive  limit,  say  9 ,  as  t  tends  to  infinity.  In  this 
case,  one  can  write 

G{t)  =  e~6F{t\  9>  0,  t>  0,  (1.4) 

where  F(t)  =  A {t)/6  is  the  c.d.f.  of  some  non-negative  random  variable  such 
that  -F(O)  =  0.  In  what  follows,  we  will  call  the  model  given  by  (1.4)  the 
bounded  cumulative  hazard  (BCH)  model. 

Clearly,  estimating  the  proportion  of  cured  patients  may  have  important 
medical  implications.  In  addition,  clinical  covariates  may  exert  dissimilar  effects 
on  the  probability  of  cure  and  the  timing  of  tumor  relapse  or  other  events  of 
interest.  There  are  at  least  two  advantages  of  the  BCH  model:  (1)  they  enrich 
our  ability  to  interpret  survival  analysis  in  terms  of  characteristics  that  have 
a  clear  biomedical  meaning;  (2)  they  lead  to  more  general  regression  models, 
thereby  extending  our  ability  to  describe  actual  data.  It  took  a  long  time  to 
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realize  these  virtues  of  the  BCH  model.  The  first  move  in  this  direction  was 
due  to  Haybittle  (1959,  1965).  The  author  proceeded  from  the  observation  that 
in  some  clinical  data  on  cancer  survival,  an  actuarial  estimate  of  the  hazard 
function  tends  to  decrease  exponentially  with  time.  If  the  same  property  holds 
for  the  true  hazard,  expression  (1.4)  assumes  the  form: 

G{t)  =  e-e^-e~at\  a  >  0.  (1.5) 

A  comprehensive  treatment  of  this  model  was  given  by  Cantor  and  Shuster 
(1992).  Clearly,  the  Gompertz-like  model  given  by  formula  (1.5)  is  a  special 
case  of  formula  (1.4)  with  the  function  F(t)  specified  by  an  exponential  c.d.f. 
with  parameter  a. 

The  representation  (1.4)  for  the  BCH  model  was  first  introduced  in  the 
paper  of  Yakovlev  et  al.  (1993)  and  discussed  later  as  an  alternative  to  the  bi¬ 
nary  mixture  model  by  Yakovlev  (1994).  Interestingly  enough,  Yakovlev  et  al. 
(1993)  proceeded  from  purely  biological  considerations;  the  idea  of  imposing  a 
constraint  on  the  behavior  of  the  hazard  function  was  introduced  much  later 
in  Tsodikov  et  al.  (1998).  In  fact,  the  authors  proposed  a  simple  mechanisti¬ 
cally  motivated  model  of  tumor  recurrence  yielding  an  improper  survival  time 
distribution.  Under  this  model,  the  probability  of  tumor  cure  is  defined  as  the 
probability  of  no  clonogenic  tumor  cells  surviving  by  the  end  of  treatment.  A 
comprehensive  account  of  the  BCH  model  and  associated  statistical  methods  is 
given  by  Tsodikov,  Ibrahim,  and  Yakovlev  (2003). 

In  this  paper,  we  discuss  methods  of  semiparametric  inference  based  on  the 
BCH  model  that  have  been  recently  developed.  We  then  illustrate  the  method 
by  applying  several  (semiparametric  and  parametric)  versions  of  the  BCH  model 
to  breast  cancer  recurrence  data  from  the  Curie  Institute. 


1.2  BCH  Regression  Models 

1.2.1  Mixture  Models  and  Generalizations 

In  this  subsection  we  discuss  how  certain  BCM  models,  and  in  particular,  the 
double  proportional  hazards  model,  arise  naturally  within  a  more  general  class 
of  regression  models,  called  Nonlinear  Transformation  Models. 

Perhaps  the  simplest  bounded  cumulative  hazard  model  is  obtained  by  mak¬ 
ing  the  assumption  that  the  cumulative  hazard  is  bounded  in  a  proportional 
hazards  (PH)  model.  In  this  way  we  obtain  the  so-called  improper  PH  model 

G{t\z)  =  exp{-0(/3,  z)F(t)},  (1.6) 

where  j3  is  a  vector  of  regression  coefficients,  and  2  is  a  vector  of  covariates, 
and  -0(/3,  z)  is  a  fixed  (known)  function  relating  (3  to  z.  A  general  class  of 
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regression  models,  Nonlinear  Transformation  Models  (NTM),  was  proposed  in 
Tsodikov  (2002,2003),  in  the  context  of  semiparametric  models: 

G(t\z)=7(F(tMz),  (1.7) 

where  7 (x\(3,z)  is  some  parametrically  specified  cumulative  distribution  func¬ 
tion  in  x  with  support  on  [0, 1].  In  the  examples  we  assume  7  is  parametrized 
through  a  set  of  parameters/predictors  0,77, where  each  predictor  is  further 
parametrized  using  (generally  different)  sets  of  regression  coefficients  /?i,/32,  •••» 
so  that  6  —  exp{p[z},r]  =  exp{/3'2z}.  Linear  transformation  models  consid¬ 
ered  by  Cheng  et  al.  (1995)  represent  a  subclass  of  NTM.  In  order  to  make 
(1.7)  a  BCH  model,  the  following  assumptions  are  made  to  enforce  the  limit 
G(t\(3,  z)  — p  >  0: 

7(01/3,  *)>(),  limF(f)  =  0.  (1.8) 

The  restriction  F(last  failure)  =  0  proposed  by  Taylor  (1995)  in  the  context 
of  the  two-component  mixture  model  removes  the  over-parameterization  of  the 
description  of  the  baseline  cure  rate  through  F  and  h.  Following  the  restric¬ 
tion,  the  estimate  F  is  assumed  to  satisfy  F(last  failure)  =  0.  Moreover,  it  is 
necessary  for  separation  of  the  long-  and  short-term  covariate  effects  on  survival. 

Alternatively,  a  BCH  model  can  be  formulated  as  a  two-stage  model.  First, 
an  unobservable  random  variable  M  is  postulated  with  distribution  p(m\z)  that 
depends  on  covariates.  Second,  the  observed  survival  function  is  formed  as 

G(t\f3,z)  =  E[F(t)M\z],  (1.9) 

where  the  expectation  is  taken  with  respect  to  M  given  z,  and  the  distribution 
of  M  depends  on  (3  and  z  .  The  model  (1.9)  represents  a  generalized  PH  frailty 
model,  which  we  simply  call  the  PH  mixture  model  (Tsodikov, 2002;  2003), 
where  M  is  assumed  to  be  an  arbitrary  non-negative  random  variable.  In  this 
context,  M  can  be  interpreted  as  a  missing  covariate  in  a  PH  model.  Since 
all  models  considered  previously  in  this  paper  are  particular  cases  of  (1.9),  an 
estimation  procedure  designed  for  the  PH  mixture  model  represents  a  universal 
tool  for  statistical  inference  with  such  models. 

The  discrete  mixture  model  (i.e.  (1.9)  with  a  discrete  random  variable  M) 
can  be  linked  to  the  NTM  class.  Observe  that  the  probability  generating  func¬ 
tion  of  a  discrete  random  variable  M 

00 

<Pm(x)  =  PmX™ 

m= 0 

is  a  distribution  function  in  x  with  the  support  [0,1].  Indeed,  since  pm  is  non¬ 
negative,  <pm(x)  is  increasing  in  x .  Also,  v?m(1)  =  Em=oPm  =  1.  If  M  is 
discrete  and  depends  on  covariates,  (1.9)  can  be  rewritten  as  an  NTM  with 

7(x|z)  =  ( Pm{x\z ). 
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With  the  discrete  mixture  model,  the  probability  of  cure  is  given  by  7(0|z)  = 
<PM{O\0,z)  =  po{0,z). 

We  can  now  represent  both  mixture  and  BCH  models  discussed  earlier  in  this 
paper  as  members  of  the  Mixture  -  NTM  model  family.  Consider  an  extension 
of  the  improper  PH  model  allowing  for  dissimilar  covariate  effects  on  long-  and 
short-term  survival.  To  construct  an  extended  hazard  model  (the  term  was 
introduced  by  Etezadi-Amoli  and  Ciampi  (1987)),  we  employ  the  fact  that  F  is 
a  survival  function.  Incorporating  covariates  into  F,  we  can  add  a  short-term 
effect  to  the  improper  PH  model.  The  class  of  extended  hazard  models 

G{t\P,  z)  =  exp  {— 0(/3i,  z) [1  -7(F(f)|/?2,2)]}  ,  (1-10) 

where  7  is  an  NTM,  0  =  (Pi,  Pi),  Pi  =  (P(H,  fin,  —  1,2  was  introduced  in 
Tsodikov  (2002).  If  7  is  a  mixture  model  itself,  then  (1.10)  is  also  a  mixture 
model  with  7  —  exp{-0[l  -  7]}.  The  mixing  variable  M  that  generates  the 
family  (1.10)  has  a  well  defined  structure  of  the  compound  Poisson  variable 

(1.11) 

fc=l 

where  v  is  a  Poisson  random  variable,  and  are  i.i.d.  copies  of  a  random 
variable  £  with  Laplace  transform  7(e"A).  The  binary  distribution  for  v  gives 
rise  to  the  two-component  mixture  class  of  models 

G(t\P,z)  =p(Pi,z)  +  p{0i,  z)j[F(t)\f32,  z\,P  =  1  -p.  (1.12) 

Kuk  and  Chen  (1992)  proposed  a  regression  model  in  the  form  of  (1.12)  with  p 
being  a  logistic  regression,  and  7  a  proportional  hazards  regression.  This  model 
was  further  studied  by  Sy  and  Taylor  (2000)  and  Peng  and  Dear  (2000).  The 
Poisson  distribution  for  v  leads  to  the  bounded  hazard  family  of  mixture  models. 
For  example,  let  £  be  degenerate  (nonrandom):  Pr(£  =  rj(z))  =  1.  Then  M  = 
7](p 2,z)v,  where  v  is  Poisson  with  expectation  0{z).  With  the  parametrization 
0(/?i,  z)  =  exp (0i z)  and  r)((3 2,  z)  =  exp(/?2 z),  we  have  the  PHPH  model 

G(t\p,  z )  =  exp  [-  exp (p[z)  {l  -  Fex p^}]  ,  (1.13) 

The  model  (1.13)  was  proposed  by  Broet  et  al.  (2001)  in  the  context  of  two 
sample  score  tests  for  long-  and  short-term  covariate  effects.  An  intercept  term 
(log(cure  rate))  is  included  in  fix  but  not  fa-  With  foo  set  to  zero,  /?io  may  be 
interpreted  as  the  baseline  log  cure  rate.  With  the  exponential  parametrization 
of  predictors  (  1.13),  the  baseline  survival  function  takes  the  form  Gb(t\/3o,  z)  — 
exp  [—  exp(f3xoz)  {1  -  7(F|0)}],  where  0=(0,0, ...)  and  f30  =  ((/?io,0,0,  ...,),0). 
As  in  the  Cox  model  (Cox,  1972),  the  regression  coefficient  i  =  1,2,  j  = 
1,2,...  correspond  to  the  relative  effects  of  the  covariate  Z{j  on  the  long-  or 
short-term  predictor,  i  =  1,2,  respectively. 
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At  times  an  extension  of  (1.7)  may  be  convenient.  For  brevity,  we  describe 
the  extension  only  for  the  case  we  need  later.  Let  F\  and  F2  be  two  (usually 
parametrically  specified)  survival  functions.  Consider  the  model 

G(t\P,z)  =  exp  [—  exp(/3jz)  {1  -  Fi}  -  exp(/?2 z)  {1  -  F2 }]  .  (1.14) 

Models  of  this  type  are  called  competing  risks  models,  since  F\  and  F2  may 
clearly  be  interpreted  as  the  c.d.f’s  for  competing  causes  of  failure.  If  we  stip¬ 
ulate  that  the  means  and  /x2  of  the  corresponding  c.d.f’s  Fi  and  F2  satisfy 
Ml  <  M2 7  then  6\  and  62  may  be  interpreted  in  terms  short-term  and  long-term 
effects,  respectively.  In  this  way  we  have  can  form  a  very  different  model  from 
the  PHPH  model  with  a  similar  interpretation.  Note  that  if  we  impose  the  pair 
of  restrictions  lim^oo  Fi  (£)  =  0,  and  lim^00F2(i)  =  0,  the  cumulative  hazard 
is  bounded,  so  that  (1.14)  is  in  the  class  of  models  (1.4),  with  6  =  0\  +  02. 

1.2.2  An  EM-based  estimation  procedure  for  the  PH  mixture 
model 

The  issue  of  potentially  unlimited  dimension  has  been  the  most  critical  de¬ 
terrent  to  the  use  of  maximum  likelihood  estimation  (MLE)  in  semiparamet- 
ric  regression  models.  Methods  based  on  partial  likelihood  are  specific  to  the 
proportional  hazards  model,  and  do  not  extend  to  other  models.  The  Newton- 
Raphson  procedure  requires  taking  inverse  of  the  information  matrix,  which  gets 
computationally  prohibitive  and  unstable  with  increasing  dimension.  Charac¬ 
terizing  the  future  directions  in  survival  analysis  in  their  recent  editorial  in 
Biometrics ,  Fleming  and  Lin  (2000)  pointed  out  that  “it  would  be  highly  useful 
to  develop  efficient  and  reliable  numerical  algorithms  for  the  semiparametric 
estimation...”. 

Recently,  Tsodikov  (2003)  generalized  the  EM  algorithm  for  the  frailty 
model  into  a  universal  ‘distribution- free’  procedure  applicable  to  the  NTM  class 
(1.7).  This  family  of  algorithms  (Quasi-EM)  is  a  subclass  of  the  so-called  MM 
algorithms  based  on  surrogate  objective  functions  (Lange,  Hunter,  and  Yang, 
2000).  Broadly  defined,  an  MM  algorithm  substitutes  a  computationally  sim¬ 
pler  surrogate  objective  function  for  the  target  function  on  each  step  of  the 
procedure  (similar  to  the  E-step  of  the  EM).  Maximizing  the  surrogate  objec¬ 
tive  function  drives  the  target  function  in  the  correct  direction.  Thus,  a  difficult 
maximization  procedure  is  replaced  by  a  simpler  one.  The  idea  of  the  QEM  ap¬ 
proach  is  to  obtain  a  surrogate  objective  function  that  does  not  depend  on  the 
entire  distribution  of  the  missing  data,  that  is,  the  derivation  of  a  ’distribution- 
free’  E-step  for  the  PH  mixture  model.  The  following  statement  is  the  key 
result. 

Proposition  1.2.1  (Tsodikov,  2003)  Let  r  be  the  observed  event  time,  and 
c  the  observed  censoring  indicator  (c  =  1  for  failures,  and  c  =  0  otherwise). 
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Under  the  PH  mixture  model 


and  if  Ff(t)  >  0,  the  conditional  expectation  of  M,  given  the  observed  event ,  is 
given  by 

c}  =  ©[F(r)|-,c], 


where 


0[x|*,  c]  =  c  +  x 


-y(c+l)  (a;|-) 

-y(c)  (ar|-) 


and  7W(x|-)  =  dlj(x\-)/dx\  i  —  1,2  and  7^(*I0  =  7(z|-)- 


The  above  statement  indicates  that  the  E-step  can  be  constructed  using 
the  first  two  derivatives  of  the  NTM-generating  function  7  without  any  knowl¬ 
edge  or  even  existence  of  the  mixing  random  variable  M.  Specification  of  the 
algorithm  for  a  particular  model  requires  evaluation  of  0  for  that  model. 

Cure  models  require  a  correction  of  ©  at  the  last  observation  (see  1.16) 
below).  Introduce  a  set  of  times  t i  =  1, . . . ,  n,  arranged  in  increasing  order, 
where  tn+\  =  00.  Associated  with  each  U  is  a  set  of  individuals  T>i  with  covari¬ 
ates  Zij,  j  €  Vi  who  fail  at  U,  and  a  similar  set  of  individuals  Ci  with  covariates 
zij ,  j  G  C{  who  are  censored  at  U. 

According  to  the  nonparametric  maximum  likelihood  method,  the  model 
can  be  fitted  by  maximizing  the  generalized  loglikelihood  i  with  respect  to 
the  regression  parameters  (3  and  unspecified  survival  function  F  (or  the  cor¬ 
responding  distribution  function  F).  An  argument  similar  to  that  adopted  by 
Kalbfleisch  and  Prentice  (1980)  can  be  used  to  verify  that  £  is  maximized  by  a 
step- function  F  with  steps  at  the  times  of  failures.  Let  r*,  k  =  1, . . . ,  K  be  the 
time  points  in  ascending  order  where  failures  occur  (V  is  not  empty).  Denote 
by  rk  the  rank  of  77  in  the  set  {£i}.  The  ranks  locate  the  points  77*  on  the 
net  {ti}:  r \  =  tTk.  By  definition  we  set  tk+  1  =  n  +  1  and  To  =  0.  For  any  A(t) 
let  Ai  -  A(U ). 

Differentiating  the  generalized  loglikelihood  with  respect  to  A — 
Hk~  1,  k  —  1, . . . ,  tk- 1,  we  obtain  the  score  equations  for  F  in  the  form 


A  Hk  = 


Dk 

Si  j  €72.fc  ®  ( -^  \0y  zij  ) 


(1.15) 


where  7 Zk  =  U t=k^i  is  the  risk  set  rib  Dk  is  the  number  of  failures  at 
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tfc,  and 


7(^ 


7'(*l|ft*y)  15 

7,(Frif_1|/?J2ij)  p 

TK-xiPMjW  mzijrTi<-" 

o, 


3  €  Ci,  i  <  -  1 

j  £T>i,i<  tk-  1 

j  G  *  —  r*r 
j  G  Ci,  i  >  r*r. 


(1.16) 


The  fact  that  the  contributions  of  the  observations  ij  to  the  likelihood  as  well 
as  ©  at,  or  after  the  last  failure  (i  >  tk),  are  different  from  their  counterparts 
before  the  last  failure  (i  <  vk  ~  1)  is  due  to  the  restriction  F{  =  0,  i  =  r#, . . . ,  n 
representing  the  fact  that  F  is  a  proper  survival  function.  The  above  distinction 
is  not  made  if  the  model  is  a  general  NTM  with  unrestricted  F. 

It  is  interesting  to  note  that  the  score  equations  have  the  form  of  the  Nelson- 
Aalen-Breslow  estimator  for  the  PH  model  with  the  usual  predictor  replaced 
by  0.  Given  /?,  iterations  with  respect  to  F  can  be  carried  out  as  follows: 

qE:  With  fcth-iteration  F^k\  compute  0^  for  each  subject. 

MA:  Update  F  using  the  Nelson-Aalen-Breslow  estimator  (1.15), 

It  can  be  shown  (Tsodikov,  2003)  that  if  0(x,  *)  is  a  nondecreasing  function  of 
x ,  or  if  7  is  a  PH  mixture  model  (a  stronger  assumption),  then  each  iteration 
described  above  improves  the  likelihood.  Also,  this  assumption  makes  the  above 
procedure  a  member  of  the  MM  family  (Lange,  Hunter,  and  Yang,  2000),  and 
the  convergence  properties  follow  from  the  general  MM  theory. 

There  are  many  ways  to  build  a  particular  model  fitting  algorithm  based  on 
the  principles  above,  and  this  depends  on  how  the  maximization  with  respect  to 
1 3  is  incorporated  into  the  procedure.  One  possibility  is  to  maximize  the  profile 
likelihood 

tpr(P)=t(P,F(p)l  (1.17) 

with  respect  to  /?,  where  F(/3)  is  determined  at  each  step  of  maximization  by 
iterating  the  steps  (qE)  and  (MA)  until  convergence.  This  method  was  used 
for  the  semiparametric  model  used  in  the  analysis  presented  below. 


1.3  Analysis  of  Data  from  the  Curie  Institute 

Two  bounded  cumulative  hazards  models,  the  double  proportional  hazards 
(1.13)  and  the  competing  risks  model  (1.14)  were  applied  to  data  from  women 
diagnosed  with  primary  breast  carcinoma  between  1981  and  1991  at  the  Curie 


Bounded  Cumulative  Hazard  Model 


9 


Institute.  The  endpoint  of  interest  was  recurrence  in  the  ipsilateral  breast. 
Both  models  admit  interpretation  in  terms  of  long  and  short  term  effects,  so 
sensitivity  to  model  specification  could  be  explored.  In  addition  we  used  three 
versions  of  the  PHPH  model,  differing  in  the  way  F  was  specified,  to  explore 
sensitivity  to  specification  of  F. 

The  general  treatment  policy  for  the  Curie  Institute  data  was  aimed  to¬ 
wards  breast  conservation  through  a  combined  application  of  radiotherapy  and 
limited  surgery.  A  detailed  description  of  subcohorts  of  these  patients  is  given 
in  Fourquet  et  al  (1989).  The  strategy  of  the  analysis  was  to  choose  important 
predictors  using  backward  elimination.  Women  with  missing  tumor  size,  grade, 
nodal  status,  estrogen  receptor  status,  progesterone  receptor  status,  or  treat¬ 
ment  information  were  excluded  from  the  analysis.  Women  for  whom  grade  was 
coded  as  'not  done’  were  included  however.  After  these  exclusions,  a  sample  of 
6232  women  out  of  an  initial  total  of  9899  was  available  for  analysis.  The  vast 
majority  (3278,  or  89%)  of  the  excluded  patients  had  either  missing  estrogen 
or  progesterone  receptor  status,  or  both. 

The  initial  model  from  which  backward  elimination  proceeded  included  both 
long-  and  short-term  effects  for  all  variables.  The  variables  included  were:  pri¬ 
mary  tumor  size  at  detection,  estrogen  and  progesterone  receptor  status,  clinical 
axillary  lymph  node  status  (nodal  involvement),  age,  and  tumor  size,  histologi¬ 
cal  grade,  and  treatment.  Estrogen  receptor  status  (ER),  progesterone  receptor 
status  (PgR)  and  nodal  involvement  were  taken  to  be  indicator  variables.  Treat¬ 
ment  was  taken  to  be  a  categorical  variable  with  four  levels:  radiotherapy  alone 
(RT),  mastectomy  alone,  tumorectomy  plus  radiotherapy,  and  mastectomy  plus 
radiotherapy.  Age  (in  years)  at  diagnosis  and  primary  tumor  size  (in  millime¬ 
ters)  were  considered  as  continuous  covariates  in  this  analysis.  Histological 
grading  (Scarff  Bloom)  was  taken  as  a  variable  with  three  categories  (I,  Ha  and 
lib  combined,  and  III).  Tumor  grade  was  highly  correlated  with  stage,  and,  in 
combination  with  the  other  covariates,  was  chosen  preferentially  over  stage  as 
predictive  of  local  recurrence  in  a  preliminary  analysis  with  all  covariates.  Each 
of  the  predictors  included  in  the  backward  elimination  was  highly  significant  in 
univariate  analysis. 

The  three  PHPH  models  differed  in  the  way  in  which  F  was  specified.  Our 
most  flexible  model  used  a  semiparametric  specification  of  F.  We  refer  to  this 
model  as  the  'semiparametric  PHPH  model’.  Our  intermediate  form  specified 
that  F  be  a  linear  spline.  For  our  spline  models  we  use  fixed,  equally  spaced 
knots  between  t  =  0  and  t  =  tmax ,  where  tmax  is  the  last  failure  time.  F  is 
parametrized  by  the  values  at  the  (ten)  knots,  subject  to  the  conditions  that 
F  is  monotone  decreasing,  F( 0)  =  1,  and  F(tmax)  =  0.  Linear  interpolation 
is  used  between  the  knots  to  ensure  continuity.  We  refer  to  this  model  as  the 
'spline  PHPH  model’.  The  most  restrictive  parametric  form  specified  that  F 
have  a  Weibull  distribution.  This  model  uses  only  two  parameters,  (a  median 
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and  shape  parameter),  to  specify  F.  We  refer  to  this  model  as  ‘Weibull  PHPH 
model5.  Weibull  distributions  were  also  used  for  both  and  F2  of  the  compet¬ 
ing  risks  model.  A  semiparametric  or  spline-based  form  of  the  competing  risks 
model  was  not  used,  as  it  is  difficult  to  retain  identifiability  with  such  a  model. 

Estimates  for  semiparametric  PHPH  model  were  obtained  by  maximizing 
the  profile  likelihood  (1.17)  using  the  Powell  algorithm  (Himmelblau,  1972),  as 
described  in  Section  1.2.2.  We  fit  the  parametric  and  spline  PHPH  models, 
as  well  as  the  competing  risks  model,  by  backward  elimination  as  well.  For 
these  models,  the  complete  likelihood  was  maximized  via  the  Powell  algorithm. 
Variables  with  several  levels  were  considered  in  blocks.  Table  1.1  gives  the 
significance  levels  of  variables  at  the  step  of  removal  or  inclusion  in  the  final 
model,  calculated  using  the  likelihood  ratio  test.  The  cycle  in  which  predictors 
were  removed  is  given  for  all  variables  not  in  the  final  model.  The  significance 
level  a  —  0.05  was  chosen  for  inclusion  in  the  final  model.  Although  predictors 
are  removed  in  a  different  order,  in  the  end  the  final  list  of  predictors  selected  for 
the  semiparametric  PHPH  and  Weibull-based  PHPH  models  were  exactly  the 
same.  One  more  short-term  predictor  (size)  was  selected  as  significant  for  for 
the  spline  PHPH  model.  The  competing  risks  model  contained  the  additional 
short-term  effect  of  age  (but  no  size  parameters).  We  note  that  in  each  case, 
there  are  highly  significant  short-term  effects,  so  that  the  proportional  hazards 
assumption  is  easily  rejected. 

Parameter  estimates  for  each  of  the  models  are  provided  in  Table  1.2.  The 
parametrization  for  the  PHPH  and  competing  risks  models  are  given  by  (1.13) 
and  (1.14)  respectively.  Note  that  the  cure  rate  estimates,  although  comparable 
to  each  other,  apply  to  the  extrapolated  ‘baseline5  subject  with  age  equal  to 
zero  and  tumor  size  equal  to  zero.  For  the  cure  rate  model,  the  median  for  one 
component,  interpreted  as  long  term,  was  123  years,  while  the  median  for  the 
other  component,  labeled  short  term,  was  23  years. 

The  models  were  generally  in  very  good  agreement.  For  all  three  PHPH 
models  and  the  competing  risks  model,  positive  progesterone  receptor  status 
and  older  age  all  significantly  increased  the  probability  of  ‘local  cure5,  while  lack 
of  nodal  involvement,  positive  estrogen  receptor  status,  and  low  histological 
grade  significantly  increased  the  mean  time  to  local  recurrence  for  uncured 
patients.  Interestingly,  tumor  size  was  chosen  as  (marginally)  significant  only 
for  the  spline  PHPH  model.  For  the  spline  PHPH  model,  there  was  a  significant 
decrease  in  the  mean  time  to  recurrence  with  increasing  primary  tumor  size.  For 
the  competing  risk  model,  older  age  had  a  significant  short  term  benefit  as  well 
as  a  long  term  benefit.  The  cure  rate  parameters  were  in  good  agreement  for  all 
the  PHPH  models,  but  the  cure  rate  parameter  for  the  competing  risks  model 
varied  somewhat,  perhaps  because  of  the  quite  different  model  specification. 

Treatment  had  both  a  highly  significant  long  term  and  short  term  effect. 
Compared  to  the  ‘baseline5  group  of  patients  receiving  radiotherapy  alone,  every 
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Effect 

Predictor 

Semi. 

PHPH 

Model 

Spline  Weibull 
PHPH  PHPH 

Comp. 

Risks 

Nodes 

1  (0.92) 

1  (0.78) 

2  (0.88) 

4  (0.18) 

PrR 

*  (0.02) 

*  (0.02) 

*  (0.02) 

*  (0.01) 

Long 

ER 

4  (0.23) 

6  (0.09) 

5  (0.20) 

2  (0.94) 

Term 

Histology 

5  (0.16) 

5  (0.14) 

6  (0.15) 

1  (0.90) 

Treatment 

*  (E-13) 

*  (E-12) 

*  (E-15) 

*  (E-13) 

Age 

*  (E-13) 

*  (E-13) 

*  (E-13) 

*  (IE-9) 

Size 

2  (0.86) 

3  (0.51) 

1  (0.99) 

3  (0.69) 

Nodes 

*  (3E-4) 

*  (2E-4) 

*  (IE-6) 

*  (7E-7) 

PgR 

3  (0.22) 

4  (0.19) 

4  (0.18) 

5  (0.14) 

Short 

ER 

*  (5E-9) 

*  (4E-9) 

*  (E-10) 

*  (IE-9) 

Term 

Histology 

*  (IE-8) 

*  (8E-9) 

*  (2E-9) 

*  (7E-9) 

Treatment 

*  (9E-5) 

*  (5E-5) 

*  (6E-7) 

*  (IE-6) 

Age 

6  (0.12) 

2  (0.58) 

3  (0.25) 

*  (8E-4) 

Size 

7  (0.10) 

*  (0.04) 

7  (0.07) 

6  (0.09) 

Table  1.1:  Step  at  which  predictors  were  removed,  and  significance  level  (in 
parentheses)  for  the  three  PHPH  models  and  the  competing  risks  model  fitted 
to  breast  cancer  data  using  backward  elimination.  Predictors  that  were  not 
removed  are  indicated  by  *.  The  significance  level  applies  either  to  the  removal 
step  (for  predictors  that  were  eliminated  by  backward  selection)  or  the  final 
model  (for  all  the  other  predictors). 
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other  group  of  patients  had  a  greater  chance  of  local  cure  for  all  the  models,  with 
those  receiving  mastectomy  (with  or  without  RT)  having  the  best  chance  of  local 
cure.  Curiously,  according  to  the  all  the  PHPH  models,  compared  to  patients 
receiving  only  radiotherapy,  patients  receiving  mastectomy  plus  radiotherapy 
or  mastectomy  alone,  and  who  have  not  been  locally  cured,  had  smaller  mean 
time  to  recurrence.  On  the  other  hand,  patients  who  received  lumpectomy  plus 
radiotherapy  were  less  likely  to  have  been  cured  locally  than  those  receiving 
mastectomy  or  mastectomy  plus  radiotherapy,  but  the  mean  time  to  recurrence 
was  increased.  For  the  competing  risk  model  mastectomy,  mastectomy  plus 
RT,  and  lumpectomy  +  RT  all  had  comparable  beneficial  effect  on  time  to 
recurrence  compared  to  RT  alone. 


1.4  Discussion 

Bounded  cumulative  hazards  regression  models  have  advantages  over  other 
models:  they  are  readily  interpretable  in  terms  of  characteristics  with  biomedi¬ 
cal  meaning,  and  they  have  more  flexibility  than  more  common  modeling  tools, 
and  can  be  fit  using  semiparametic  methods.  These  advantages  have  been  illus¬ 
trated  by  our  application  to  the  Curie  Institute  data,  presented  in  the  preceding 
section.  Even  though  our  results  appear  to  be  in  agreement  with  the  literature, 
our  approach,  which  includes  both  long  and  short  term  effects,  provides  more 
information  than  the  typical  proportional  hazards  modeling.  From  our  model 
we  can  see  that  estrogen  receptor  status,  nodal  involvement,  and  histological 
grade  appear  to  largely  affect  the  timing  of  local  recurrence,  rather  than  the 
probability  of  local  cure.  Age,  on  the  other  hand,  has  a  strong  effect  on  the 
probability  of  local  cure,  and  treatment  on  both  the  probability  of  local  cure  as 
well  as  the  timing  of  recurrence. 

The  results  of  our  application  of  the  PHPH  and  competing  risks  models 
to  the  Curie  Institute  data  suggest  that  the  long  and  short  terms  effects  are 
largely  robust  to  changes  in  model  specification  (from  PHPH  to  competing  risks 
model)  or  changes  in  specification  of  time  to  failure  distributions.  In  addition, 
our  results  appear  to  be  largely  consistent  with  the  breast  cancer  literature: 
decreased  probability  of  local  relapse  is  associated  with  older  age  at  onset, 
negative  nodes,  positive  estrogen  receptor  status,  and  low  histological  grade, 
particularly  after  breast  conserving  therapy.  (See,  for  example,  McReady  et  al. 
(1996),  who  analyzed  breast  cancer  treated  with  lumpectomy  alone).  Young 
age  is  often  regarded  as  an  important  risk  factor  not  only  for  local  recurrence, 
but  distant  metastasis  and  shorter  overall  survival  as  well  (Elkhuizen  et  al., 
1998).  Irradiation  of  the  breast  in  patients  receiving  lumpectomy  is  regarded 
as  reducing  risk  of  recurrence,  except,  perhaps,  for  patients  at  very  low  risk 
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Effect 

Predictor 

Level 

Semi. 

PHPH 

Model 

Spline  Weibull 
PHPH  PHPH 

Comp. 

Risks 

PgR 

Negative 

_ 

— 

— 

— 

Positive 

-0.16 

-0.16 

-0.16 

-0.25 

Treatment 

RT 

— 

— 

— 

— 

Long 

Mastectomy 

-1.29 

-1.27 

-1.28 

-1.26 

Term 

RT+Lump. 

-0.40 

-0.38 

-0.38 

-0.61 

RT+Mast. 

-1.57 

-1.57 

-1.53 

-1.55 

Age  (yrs.) 

(continuous) 

-0.023 

-0.023 

-0.023 

-0.026 

Cure  Rate 

0.064 

0.064 

0.068 

0.043 

Nodes 

Negative 

_ 

_ 

— 

— 

Positive 

0.47 

0.42 

0.50 

0.66 

ER 

Negative 

— 

— 

— 

— 

Positive 

-0.60 

-0.60 

-0.64 

-0.81 

Short 

Histology 

Not  done 

— 

— 

— 

— 

Term 

I 

-0.57 

-0.53 

-0.64 

-1.11 

II 

-0.17 

-0.13 

-0.20 

0.15 

III 

0.29 

0.34 

0.25 

0.44 

Treatment 

RT 

— 

— 

— 

— 

Mastectomy 

0.36 

0.39 

0.37 

-0.74 

RT+Lump. 

-0.47 

-0.39 

-0.52 

-0.87 

RT+Mast. 

0.67 

0.70 

0.63 

-0.90 

Age  (yrs.) 

(continuous) 

*** 

*** 

*** 

-0.020 

Size  (mm.) 

(continuous) 

*** 

0.0049 

*** 

*** 

Table  1.2:  Parameter  estimates  for  each  of  the  three  versions  of  the  PHPH 
model,  and  the  competing  risks  model,  fit  to  breast  cancer  recurrence  data 
from  the  Curie  Institute.  The  baseline  group  is  indicated  by  — .  Components 
that  are  absent  from  the  model  are  indicated  by  ***. 
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(Clark  et  al.,  1996). 

Association  of  small  tumor  size  with  decreased  risk  of  local  recurrence  has 
been  observed  in  the  literature.  (See,  for  example,  McGreedy,  1996,  or  Asselain 
et  al.,  1996).  We  do  not  see  much  effect  of  tumor  size,  as  size  was  absent  from 
two  out  of  three  of  our  final  PHPH  models,  was  only  marginally  significant 
in  the  short-term  component  of  the  other  models,  and  never  appeared  in  the 
long-term  component  of  the  models.  A  further  exploration  of  this  is  discussed 
below. 

A  tumor  recurrence  data  set  containing  a  subset  of  877  patients  from  the 
Curie  Institute  data  was  analyzed  by  Asselain  et  al.  (1996)  using  the  acceler¬ 
ated  failure  time  model  to  describe  short-term  covariate  effects.  Although  re¬ 
currence  in  both  the  ipsilateral  and  contralateral  breast  were  analyzed  together, 
the  authors  state  that  analysis  of  ipsilateral  recurrence  alone  gave  similar  re¬ 
sults.  The  model  incorporated  components  that  could  be  interpreted  in  terms  of 
long-term  effects  (mean  number  of  surviving  clonogens)  and  short-term  effects 
(time  of  progression).  A  more  limited  set  of  covariates  (age,  tumor  size,  nodal 
involvement,  and  treatment)  was  considered  in  this  analysis.  After  backward 
elimination,  treatment,  the  short-term  effect  of  tumor  size,  and  the  long-term 
effects  of  both  age  and  tumor  size.  In  accordance  with  the  analysis  presented 
in  the  previous  section,  age  appeared  to  have  a  pronounced  long-term  effect, 
but  little  effect  on  the  time  of  tumor  recurrence.  The  authors  indicate  that  the 
long-term  effect  of  age  appeared  to  be  the  predominant  one. 

The  apparent  lack  of  significance  of  tumor  size  in  the  current  analysis  ap¬ 
pears  to  contradict  the  results  of  the  Asselain  et  al.  (1996),  but  it  is  possible 
that  this  is  due  to  different  lists  of  predictors.  To  check  whether  the  inclusion 
of  receptor  status  in  the  current  analysis  is  responsible  for  the  discrepancy,  we 
removed  estrogen  and  progesterone  receptor  status  from  the  list  of  predictors 
and  fit  the  semiparametric  PHPH  model  to  the  data  by  backwards  elimina¬ 
tion.  There  is  a  significant  short-term  effect  of  tumor  size  when  this  is  done 
(p=0.026).  There  is,  in  addition,  a  significant  short-term  effect  of  age  (p  — 
0.027),  in  addition  to  the  long-term  effect  that  was  present  in  the  extended 
model.  The  other  selected  predictors  are  the  same.  We  see  that  even  after  a 
more  similar  list  of  predictors  are  included,  there  remains  some  slight  discrep¬ 
ancy  between  the  effects  of  tumor  size.  We  speculate  that  this  discrepancy  may 
be  caused  by  the  inclusion  of  more  detail  on  radiotherapy  in  the  earlier  analysis. 
Shorter  follow-up  analysis  and  smaller  sample  size  may  also  contribute  to  the 
discrepancy. 

Finally,  we  note  that  if  a  different  endpoint  is  analyzed,  such  as  cause- 
specific  survival  or  time  to  metastatic  spread,  the  effect  of  age  at  onset  may 
be  different.  For  example,  in  an  analysis  of  survival  in  women  diagnosed  with 
localized  breast  cancer  in  Utah  using  a  similar  model,  with  age  and  stage  as 
predictors,  Tsodikov  (2002)  found  that  the  long  term  effect  of  age  at  diagnosis 
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was  not  significant,  while  the  short  term  effect  remained  significant,  at  least  for 
younger  patients. 


Acknowledgment 

This  research  was  supported  by  NIA/NIH  Grant  ROl  AG14650-01 A2,  NCI/NIH 
U01  CA88177-01,  NIH/NCI  1  UOl  CA97414-01  and  a  Department  of  Defense 
grant  DAMD17-03-1-0034. 


References 

1.  Asselain,  B.,  Fourquet,  A.,  Hoang,  T.,  Tsodikov,  A.D.  and  Yakovlev, 
A.Y.  (1996).  A  parametric  regression  model  of  tumor  recurrence:  An 
application  to  the  analysis  of  clinical  breast  cancer  data,  Statistics  & 
Probability  Letters ,  29,  271-278. 

2.  Berkson,  J.  and  Gage,  R.P.  (1952).  Survival  curves  for  cancer  patients 
following  treatment,  J.  Amer.  Statist  Ass.,  47,  501-515. 

3.  Boag,  J.M.  (1949).  Maximum  likelihood  estimates  of  the  proportion  of 
patients  cured  by  cancer  therapy,  J.  Roy.  Statist.  Soc.}  Ser.  B ,  11,  15-44. 

4.  Broet,  P.,  Rycke,  Y.D.,  Tubert-Bitter,  R,  Lellouch,  J.,  Asselain,  B.,  and 
Moreau,  T.  (2001).  A  semi-parametric  approach  for  the  two-sample  com¬ 
parison  of  survival  times  with  long-term  survivors,  Biometrics ,  57,  844- 
852. 

5.  Cantor,  A.B.  and  Shuster,  J.J.  (1992),  Parametric  versus  nonparamet- 
ric  methods  for  estimating  cure  rates  based  on  censored  survival  data, 
Statistics  in  Medicine ,  11,  931-937. 

6.  Cheng,  S.,  L.  Wei,  and  Z.  Ying  (1995).  Analysis  of  transformation  models 
with  censored  data,  Biometrika, 82,  835-845. 

7.  Clark,  R.M.,  Whelan,  T.,  Levin,  M.,  Roberts,  R.,  Willan,  A.,  McCul¬ 
loch,  P.,  Lipa,  M.,  Wilkinson,  R.H.,  and  Mahoney,  L.J.  (1996).  Random¬ 
ized  clinical  trial  of  breast  irradiation  following  lumpectomy  and  axillary 
dissection  for  node-negative  breast  cancer:  an  update.  Ontario  Clinical 
Oncology  Group,  Journal  Natl  Cancer  Institute ,  88(22),  1659-64. 

8.  Cox,  D.  R.  (1972),  Regression  Models  and  Life  Tables,  (with  discussion), 
Journal  of  the  Royal  Statistical  Society ,  Series  B ,  34,  187-220. 


16 


Boucher  et  a  1. 


9.  Elkhuizen,  P.H.,  van  de  Vijver,  M.J.,  Hermans,  J.,  Zonderland,  H.M., 
van  de  Velde,  C .J.,  and  Leer,  J.W.  (1998).  Local  recurrence  after  breast- 
conserving  therapy  for  breast  cancer:  high  incidence  in  young  patients 
and  association  with  poor  survival,  Int.  J.  Rad.  Oncol  Biol  Phys , 
40(4)  859-67. 

10.  Etezadi-Amoli,  L.  and  Ciampi,  A.  (1987).  Extended  hazard  regression 
for  censored  survival  data  with  covariates:  A  spline  approximation  for 
the  baseline  hazard  function,  Biometrics  43,  181-192. 

11.  Fleming,  T.  and  Lin,  D.  (2000).  Survival  analysis  in  clinical  trials:  Past 
developments  and  future  directions,  Biometrics  56,  971-983. 

12.  Fourquet,  A.F.,  Campana,  F.,  Zafrani,  V,  Mosseri,  P.,  Vielh,  P.,  Durand, 
J.-C.,  and  Vilcoq,  J.R.  (1989).  Prognostic  factors  of  breast  recurrence  in 
the  conservative  management  of  early  breast  cancer:  A  25  year  follow-up. 
Int.  J.  Radiat.  One.  Biol  Phys.  17,  719-725. 

13.  Haybittle,  J.L.  (1959).  The  estimation  of  the  proportion  of  patients  cured 
after  treatment  for  cancer  of  the  breast.  Br.  J.  Radiol. ,  32,  725-733. 

14.  Haybittle,  J.L.  (1965).  A  two-parameter  model  for  the  survival  curve  of 
treated  cancer  patients.  J.  Amer.  Statist.  Assoc. ,  60,  16-26. 

15.  Himmelblau,  D.M.  (1972).  Applied  Nonlinear  Programming ,  McGraw- 
Hill,  Austin. 

16.  Kalbfleisch,  J.D.  and  Prentice,  R.L.  (1980).  The  Statistical  Analysis  of 
Failure  Time  Data.  John  Wiley  &  Sons,  New  York. 

17.  Kuk,  A.Y.C.,  and  Chen,  C.-H.  (1992).  A  mixture  model  combining  lo¬ 
gistic  regression  with  proportional  hazards  regression,  Biometrika ,  79, 
531-541. 

18.  Lange,  K.  and  Hunter,  D.R.,  and  Yang,  1.(2000).  Optimization  transfer 
using  surrogate  objective  functions  (with  discussion),  Journal  of  Compu¬ 
tational  and  Graphical  Statistics ,  9  1-59. 

19.  McCready,  D.R.,  Hanna,  W.,  Kahn,  H.,  Chapman,  J.A.,  Wall,  J.,  Fish, 
E.B.,  and  Lickley,  H.L.  (1996).  Factors  associated  with  local  breast  cancer 
recurrence  after  lumpectomy  alone,  Ann.  Surg.  Oncol  3(4),  358-66. 

20.  Mailer,  R.A.  and  Zhou,  S.  (1996).  Survival  Analysis  with  Long-Term 
Survivors ,  Wiley,  Chichester. 

21.  Miller,  R.G.  (1981).  Survival  Analysis ,  John  Wiley,  New  York. 


Bounded  Cumulative  Hazard  Model 


17 


22.  Nielsen,  G.,  Gill,  R.,  Andersen,  P.  and  Sorensen,  T.  (1992).  A  counting 
process  approach  to  maximum  likelihood  estimation  in  frailty  models, 
Scand.  J.  Statist .,  19,  25-43. 

23.  Peng,  Y.,  and  Dear,  K.B.G.  (2000).  A  nonparametric  mixture  model  for 
cure  rate  estimation,  Biometrics ,  56,  237-243. 

24.  Sy,  J.P,  and  Taylor,  J.M.G.  (2000).  Estimation  in  a  Cox  proportional 
hazards  cure  model,  Biometrics ,  56,  227-236. 

25.  Taylor,  J.M.G.  (1995).  Semi-parametric  estimation  in  failure  time  mix¬ 
ture  models,  Biometrics ,  51,  899-907. 

26.  Tsodikov,  A.  (1998).  Asymptotic  efficiency  of  a  proportional  hazards 
model  with  cure.  Stat.  Probab.  Letters ,  39,  1998,  237-244. 

27.  Tsodikov,  A.  (2002).  Semiparametric  models  of  long-  and  short-term 
survival:  An  application  to  the  analysis  of  breast  cancer  survival  in  Utah 
by  age  and  stage,  Statistics  in  Medicine ,  21,  895-920. 

28.  Tsodikov,  A.  (2003).  Semiparametric  models:  A  generalized  self-consistency 
approach.  J.  Roy.  Statist.  Soc.,  Ser.  B ,  in  press. 

29.  Tsodikov,  A.D.,  Ibrahim,  J.G.,  and  Yakovlev,  A.Y.  (2003).  Estimating 
Cure  Rates  from  Survival  Data:  An  Alternative  to  Two-Component  Mix¬ 
ture  Models,  J.  Amer.  Statist.  Assoc.,  in  press. 

30.  Yakovlev,  A. Yu.  (1994).  Letter  to  the  Editor.  Statistics  in  Medicine  13, 
983-986. 

31.  Yakovlev,  A.Y.,  Asselain,  B.,  Bardou,  V.J.,  Fourquet,  A.,  Hoang,  T., 
Rochefordiere,  A.,  and  Tsodikov,  A.D.  (1993).  A  simple  stochastic  model 
of  tumor  recurrence  and  its  application  to  data  on  premenopausal  breast 
cancer,  In  Biometrie  et  Analyse  de  Donnees  Spatio-Temporelles  No  12, 
(Eds.  B.  Asselain,  M.  Boniface,  C.  Duby,  C.  Lopez,  J.P.  Masson,  and  J. 
Tranchefort),  Societe  Frangaise  de  Biometrie,  ENSA  Rennes,  France,  pp. 
66-82. 


